J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Using Custom Visitors in AQL Graph Traversals

Note: this post is about the ArangoDB 2.x series

This post is about some recent extensions for the AQL graph traversal functionality in ArangoDB.

These extensions allow invoking user-defined JavaScript code for filtering and results generation in AQL queries that contain traversals.

This should make AQL graph traversals much more powerful than before. Additionally, AQL graph traversals get more configurable, allowing to write traversal functions with control-flow logic and complex filtering. As a side-effect, this change facilitates writing specialized traversal functions with much higher efficiency than the general-purpose, cover-all-cases default ones.

The extensions are currently available in the devel branch of ArangoDB on in the 2.4 branch (with 2.4.2 being the first release to include them).

Example graph

For all following examples, I’ll be using a small example graph that can be set up by running this script from the ArangoShell.

I have chosen this small graph because it is easy to understand and still complex enough to demonstrate some common traversal use cases.

The example graph consists of the following two collections:

  • v: a collection with vertex documents
  • e: an edge collection containing the connections between vertices in v

All vertices in the graph have a type attribute, with types being either root, continent, country or capital. The graph is a tree, so it has only one vertex with type root. The root vertex is named world. Below the root there are only vertices of type continent. These are also connected to some country vertices. Finally, country vertices are also connected to capital vertices:

1
root <--[is in]-- continent <--[is in]-- country <--[is in]-- capital

In the examples, we’ll only look at the vertices and ignore what the connections look like.

Custom visitors

We know the graph is a tree, so let’s print its structure in a textual format using AQL. We’ll employ a custom visitor function for this. A custom visitor is a user-defined callback function that is called for every vertex that is encountered during a graph traversal. Custom visitor functions need to be written in JavaScript and be registered once before they can be used from an AQL query.

Custom visitors have the following function signature:

visitor function signature
1
function (config, result, vertex, path)

The function parameters have the following meanings:

  • config: the traversal configuration
  • result: the result already generated by the traversal. This is important only if the visitor is designed to modify an existing result in-place
  • vertex: the currently visited vertex document
  • path: the path from the start vertex to the currently visited vertex document. The path will contain an array vertices and an array edges

Let’s register a custom visitor named myfunctions::structurePrinter. This can done by running the following code from the ArangoShell:

registering a custom visitor to print the tree structure
1
2
3
4
5
6
7
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::structurePrinter", function (config, result, vertex, path) {
  var indentation = new Array(path.vertices.length).join("  ");
  var label       = "- " + vertex.name + " (" + vertex.type + ")";
  return indentation + label;
});

Processing vertex data with a function

The above function will be called for every vertex in the graph when we use it in a traversal. Let’s do it and run the AQL query to invoke the visitor function.

I suggest running the query from the web interface’s AQL editor:

invoking the custom visitor
1
2
3
4
5
6
LET params = { 
  visitor : "myfunctions::structurePrinter", 
  visitorReturnsResults : true 
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result

The visitor will visit all vertices in the graph, starting at the vertex v/world as specified. It will then follow incoming connections using a depth-first search (this was not specified in the query as it is the default).

As we started with the root vertex of the graph, the query will visit all vertices exactly once. Fortunately the example graph is a tree and does not contain any cycles, so we do not have to care about how to make the traversal terminate. The traversal will automatically terminate after it has visited all nodes.

The AQL query should produce something like this:

query result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[
  "- World (root)",
  "  - North America (continent)",
  "    - Bahamas (country)",
  "      - Nassau (capital)",
  "    - Canada (country)",
  "      - Ottawa (capital)",
  "    - Antigua and Barbuda (country)",
  "      - Saint John's (capital)",
  "    - Barbados (country)",
  "      - Bridgetown (capital)",
  "  - Asia (continent)",
  "    - Afghanistan (country)",
  "      - Kabul (capital)",
  ...
]

This should provide a good overview of the graph’s contents.

Referring to elements in the path

To return the above result in a more structured manner, let’s overwrite the previous visitor function with one that returns the most interesting vertex attributes individually. Let’s include one that shows the nesting level of each vertex in the tree:

registering another custom visitor to return the tree structure
1
2
3
4
5
6
7
8
9
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::structurePrinter", function (config, result, vertex, path) {
  return {
    name: vertex.name,
    type: vertex.type,
    level: path.vertices.length
  };
});

Running the same AQL query will now return something like:

query result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[
  {
    "name" : "World",
    "type" : "root",
    "level" : 1
  },
  {
    "name" : "North America",
    "type" : "continent",
    "level" : 2
  },
  {
    "name" : "Bahamas",
    "type" : "country",
    "level" : 3
  },
  {
    "name" : "Nassau",
    "type" : "capital",
    "level" : 4
  },
  {
    "name" : "Canada",
    "type" : "country",
    "level" : 3
  },
  {
    "name" : "Ottawa",
    "type" : "capital",
    "level" : 4
  },
  {
    "name" : "Antigua and Barbuda",
    "type" : "country",
    "level" : 3
  },
  {
    "name" : "Saint John's",
    "type" : "capital",
    "level" : 4
  },
  ...
]

Adding control flow

Now let’s add some control flow to the visitor function. The following visitor function will also return information about each vertex’ parent – except for the root vertex, which does not have a parent:

a custom visitor with simple control flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::structurePrinter", function (config, result, vertex, path) {
  var res = {
    name: vertex.name,
    type: vertex.type,
    level: path.vertices.length
  };
  if (path.vertices.length > 1) {
    res.parent = {
      name: path.vertices[path.vertices.length - 2].name,
      type: path.vertices[path.vertices.length - 2].type
    };
  }
  return res;
});

Running our AQL query will now produce a different type of result for the root vertex than for all the other vertices:

query result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[
  {
    "name" : "World",
    "type" : "root",
    "level" : 1
  },
  {
    "name" : "North America",
    "type" : "continent",
    "level" : 2,
    "parent" : {
      "name" : "World",
      "type" : "root"
    }
  },
  {
    "name" : "Bahamas",
    "type" : "country",
    "level" : 3,
    "parent" : {
      "name" : "North America",
      "type" : "continent"
    }
  },
  {
    "name" : "Nassau",
    "type" : "capital",
    "level" : 4,
    "parent" : {
      "name" : "Bahamas",
      "type" : "country"
    }
  },
  ...
]

Of course much more things can be achieved by peeking the path variable.

Filtering

Now let’s try to restrict the results of a graph traversal to just some vertices, for example, all European countries. As we know the structure of the graph is quite simple, the following naive approach will already do:

a custom visitor returning only European country names
1
2
3
4
5
6
7
8
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::structurePrinter", function (config, result, vertex, path) {
  if (path.vertices.length > 1 &&
      path.vertices[path.vertices.length - 2].name === "Europe") {
    return vertex.name;
  }
});

But the above is clearly not ideal.

First of all, the traversal will still visit every vertex in the graph, even though most vertices will not be returned. Ideally, one will want to restrict a traversal to visit as few vertices as possible, especially in a big graph or in a graph that contains cycles.

Second, the above visitor is looking into a vertex’ direct parent for filtering. This will work for graphs that have a rigid structure, but may not work in more complex setups.

We better use a dedicated function for filtering. Such function can control if a given vertex is going to be visited (via calling the visitor function) and if its connections should be followed. It can skip non-interesting vertices early, providing a good way to make traversals more efficient.

A filter function can be specified in the filterVertices attribute of the traversal options. If specified, filterVertices needs to contain the name of a custom AQL function. A filter function again needs to be written in JavaScript and has the following signature:

filter function signature
1
function (config, vertex, path)

The filter function will be called for each vertex. It can return one of the following values:

  • "prune": visit the vertex, but do not descend into its connections
  • "exclude": do not visit the vertex, but do descend into its connections
  • [ "prune", "exclude" ]: do not visit, and do not descend
  • undefined (default): visit and descend

The following filter function will return "exclude" for the root vertex, leading to the visitor not being called for it. However, the traversal will still descend into the connections of the root node.

On the next level, all continents will be enumerated. The filter will return [ "prune", "exclude" ] for all continents but Europe, leading to the visitor not being invoked for these continents, and their connections not being followed. For the Europe vertex, it will return "exclude", meaning the visitor will not be called, but the traversal will descend into the connections of Europe.

For all vertices of type country, the visitor will be called. This is ok because the filter previously prevented the traversal from descending into any other country but Europe.

Finally, the filter will return "prune" for all countries, meaning the traversal will not descend into a country’s connections (in this case that would be the captial vertices). This will make the traversal end at the country level.

Here it is:

registering a filter for European countries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::europeFilter", function (config, vertex, path) {
  if (vertex.type === "country") {
    return "prune";
  }

  if (vertex.type === "continent") {
    if (vertex.name !== "Europe") {
      return [ "prune", "exclude" ];
    }
  }

  return "exclude";
});

Putting the logic into the filter function allows using a very simple visitor:

a very simple visitor
1
2
3
4
5
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::structurePrinter", function (config, result, vertex) {
  return vertex.name;
});

We must also slightly extend our AQL query and tell it to use our custom filter function:

invoking the custom visitor and the custom filter
1
2
3
4
5
6
7
LET params = { 
  filterVertices : "myfunctions::europeFilter",
  visitor : "myfunctions::structurePrinter", 
  visitorReturnsResults : true 
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result

Running the adjusted AQL query should produce something like:

query results
1
2
3
4
5
6
7
8
[
  "Austria",
  "Croatia",
  "Bosnia and Herzegovina",
  "Andorra",
  "Bulgaria",
  ...
]

Custom filter functions are a very general approach. Obviously much more complex tasks than shown here can be achieved with them. We have often been asked for ways to set up complex filter conditions in AQL traversals, and I hope these custom filter functions will cover most of that.

Special visitors

Traversal depths can be controlled with the general configuration parameters minDepth and maxDepth. These parameters are helpful to make the traversal only include vertices occurring after the specified distance from the start vertex, or up to the specified distance away from the start vertex. This helps bounding traversals, but is not flexible enough when handling graphs with very distinct path lengths.

Returning only leaf nodes

For example, finding leaf nodes in a graph is quite hard using a default traversal. The filterVertices function cannot be used to find leaf nodes, because filterVertices is called before a vertex’ connections are determined. The same is true for visitor functions. There were not provided any information about whether the currently visited vertex has connections or not. All a visitor could previously do to find leaf nodes is to return each visited vertex along with the full path information. Some post-processing of the traversal result with regular AQL was then required to detect the leaf nodes in that result.

This could easily get inefficient, especially in a big graphs for which the intermediate results created by the default traversal visitor grew beyond reasonable sizes.

We therefore added a mechanism that can pass information about the vertex’ connections to the visitor. This allows writing new types of visitor functions. For example, it makes it easy to write visitors that can return only leaf nodes.

In order to have the traversal pass the currently visited vertex’ connections to the visitor function, the traversal parameter order must be set to a value of "preorder-expander". The traversal’s visitor function will then be called with an additional fifth parameter named connected, which is an array of the connections of the current vertex. This array will be empty if the traversal’s expander function did not find any connections for the vertex.

Here’s a simple visitor that will make a traversal return only all leaf nodes:

a visitor that receives information about connections, too
1
2
3
4
5
6
7
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::leafNodeVisitor", function (config, result, vertex, path, connected) {
  if (connected && connected.length === 0) {
    return vertex.name + " (" + vertex.type + ")";
  }
});

And here’s an AQL query that shows how to use this type of visitor:

invoking the leaf node visitor
1
2
3
4
5
6
7
LET params = { 
  order : "preorder-expander",
  visitor : "myfunctions::leafNodeVisitor", 
  visitorReturnsResults : true 
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result

As a result, the above query will only print vertices of type capital (because those are the only leaf nodes in the graph).

The nice thing when looking at the custom visitor function is that it only filters on the number of connections, but not on vertex type or anything else specific for this type of graph.

So it seems like the above function is general purpose and can be reused for other graphs, too.

Counting vertices

Let’s say we wanted to count the number of vertices in the graph, or the number of vertices that passed our filterVertices function.

Counting globally

This is easy to achieve with a custom visitor like this:

registering a visitor that counts the number of vertices visited
1
2
3
4
5
6
7
8
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::vertexCounter", function (config, result) {
  if (result.length === 0) {
    result.push(0);
  }
  result[0]++;
});

Note that the above visitor function does not return anything, but will modify an existing result in place. As a result, it will produce one counter value, which will be increased whenever the visitor is called for a vertex.

To invoke this visitor and retrieve the count value, we have to set the visitorReturnsResults attribute in the AQL query to false. This will make the traversal code pass the existing result into the visitor and does not expect it to return any results via a return instruction. Here’s how to run this visitor:

invoking the global vertex counter
1
2
3
4
5
6
LET params = { 
  visitor : "myfunctions::vertexCounter", 
  visitorReturnsResults : false 
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result

The result will simply be:

query result
1
2
3
[
  87
]

Counting by type

Let’s say we wanted to count vertices by type. This is similar, except that now one global counter value is insufficient and we instead need an object to keep track of the different counters. We can still get away with modifying an existing result in place:

registering a visitor that counts vertices by type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::vertexCounter", function (config, result, vertex) {
  if (result.length === 0) {
    result.push({ });
  }
  var vertexType = vertex.type;
  if (! result[0].hasOwnProperty(vertexType)) {
    result[0][vertexType] = 1;
  }
  else {
    result[0][vertexType]++;
  }
});

Our invocation AQL query does not change. The result of this visitor for the example graph will be:

query result
1
2
3
4
5
6
7
8
[
  {
    "root" : 1,
    "continent" : 6,
    "country" : 40,
    "capital" : 40
  }
]

Of course such visitors can be combined with custom filters.

You probably ask why writing custom code is required to achieve a simple task like counting. The main reason for this is that the traversal functionality is very general purpose and is not optimized for a specific use case like just counting. For example, the default traversal visitor will copy the complete vertex and path information into the result.

This can produce very big intermediate results if the graph is big or vertices contain lots of data. If all we want is to count the number of vertices globally or per type, we are better off with something more specialized.

The good news is that the most simple use case “count all vertices” there is a predefined visitor named _AQL::COUNTINGVISITOR that can directly be used from a query, without prior registration of a custom function:

using the predefined countingvisitor function
1
2
3
4
5
LET params = { 
  visitor : "_AQL::COUNTINGVISITOR"
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result

Accessing components from the path

Visitors are allowed to peek into the paths variable to check how the currently visited vertex is linked to others.

The paths variable is an object with a vertices sub-attribute and an edges sub-attribute. vertices is an array including all vertices in the path from the start vertex up to the currently visited vertex. The currently visited vertex is included in this array. edges is an array including all connections (edges) between the start vertex and the currently visited vertex. This array might be empty, in case the visitor is called for the start vertex.

The following visitor function demonstrates how to peek into paths: it will produce a stringified version of the path for all leaf vertices:

registering a visitor that accesses components from the path
1
2
3
4
5
6
7
8
9
10
11
12
13
14
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::leafNodePathVisitor", function (config, result, vertex, path, connected) {
  if (connected && connected.length === 0) {
    var res = "";
    path.vertices.forEach(function(v, i) {
      if (i > 0 && i <= path.edges.length) {
        res += " <--[" + path.edges[i - 1].type + "]-- ";
      }
      res += v.name;
    });
    return res;
  }
});

It can be invoked as follows:

using the visitor that accesses path components
1
2
3
4
5
6
7
LET params = { 
  order : "preorder-expander",
  visitor : "myfunctions::leafNodePathVisitor", 
  visitorReturnsResults : true 
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result

It will return something like the following:

query result
1
2
3
4
5
6
7
8
9
10
[
  "World <--[is-in]-- Africa <--[is-in]-- Algeria <--[is-in]-- Algiers",
  "World <--[is-in]-- Africa <--[is-in]-- Angola <--[is-in]-- Luanda",
  "World <--[is-in]-- Africa <--[is-in]-- Botswana <--[is-in]-- Gaborone",
  "World <--[is-in]-- Africa <--[is-in]-- Burkina Faso <--[is-in]-- Ouagadougou",
  "World <--[is-in]-- Africa <--[is-in]-- Burundi <--[is-in]-- Bujumbura",
  "World <--[is-in]-- Africa <--[is-in]-- Cameroon <--[is-in]-- Yaounde",
  "World <--[is-in]-- Africa <--[is-in]-- Chad <--[is-in]-- N'Djamena",
  ...
]

Note: the contents of the path variable will change between calls to the visitor function. Therefore is it not safe to reference arrays or objects from path in the result for visitors that modify the result variable in place (i.e. when visitorReturnsResults is set to false). The safe way to put path components into the result of such visitors is to clone the parts of the path before putting them into result.

Passing parameters into visitors and filters

It is often useful to pass own parameters into function to provide some sort of invocation context. For example, the purpose of the following visitor function is to return an object with only certain attributes of each visited vertex:

a visitor that can return arbitrary vertex attributes:
1
2
3
4
5
6
7
8
9
10
11
var aqlfunctions = require("org/arangodb/aql/functions");

aqlfunctions.register("myfunctions::attributesPrinter", function (config, result, vertex) {
  var values = { };
  if (typeof config.data === "object" && Array.isArray(config.data.attributes)) {
    config.data.attributes.forEach(function (attribute) {
      values[attribute] = vertex[attribute];
    });
  }
  return values;
});

Which attributes the function will return can be configured by passing in an array of attribute names in the config parameter’s data sub-attribute. Here’s an AQL query that will configure the visitor to return _id and type:

invoking the attributes visitor
1
2
3
4
5
6
7
8
9
LET params = { 
  visitor : "myfunctions::attributesPrinter",
  visitorReturnsResults : true, 
  data: { 
    attributes: [ "_id", "type", "name" ] 
  }  
}
FOR result IN TRAVERSAL(v, e, "v/world", "inbound", params) 
  RETURN result