J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

IN-list Improvements

We have worked on many AQL optimizations for ArangoDB 2.6.

As a side effect of one of these optimizations, some cases involving the handling of large IN-lists have become much faster than before. Large IN-lists are normally used when comparing attribute or index values against some big array of lookup values or keys provided by the application.

Let’s quickly create and populate a collection named keys so that we can use some IN-list queries on it later on:

setting up example data
1
2
3
4
5
6
db._create("keys");

// insert 100k documents with some defined keys into the collection
for (var i = 0; i < 100000; ++i) {
  db.keys.insert({ _key: "test" + i });
}

And here is a query to all find documents with one of the provided keys test0 to test999. The IN-list here contains 1,000 values:

using an IN-list with 1,000 values
1
2
3
4
5
6
var keys = [ ];
var n = 1000;
for (var i = 0; i < n; ++i) {
  keys.push("test" + i);
}
db._query("FOR doc IN keys FILTER doc._key IN @keys RETURN doc", { keys: keys }); });

When invoked from the ArangoShell, this takes around 0.6 seconds to complete with ArangoDB 2.5.

Increasing the length of the IN-list from 1,000 to 5,000 values makes this run in around 15 seconds. With an IN-list of 10,000 values, this already takes more than 60 seconds to complete in 2.5.

Obviously longer IN-lists weren’t handled terribly efficiently in 2.5, and should be avoided there if possible.

I am glad this has been fixed in 2.6. Following is a comparison of the above query for different IN-list sizes, run on both ArangoDB 2.5 and 2.6.

2.5 and 2.6 with different IN-list sizes
1
2
3
4
5
6
7
# of IN-list values    Execution time (2.5)   Execution time (2.6)
------------------------------------------------------------------
              1,000                  0.67 s                 0.03 s
              5,000                 15.34 s                 0.12 s
             10,000                 63.48 s                 0.20 s
             50,000                   n/a                   0.81 s
            100,000                   n/a                   1.60 s

Looks like 2.6 handles longer IN-lists way better than 2.5! The above figures suggest that execution times now scale about linearly with the number of IN-list values. This also leads to reductions in query execution times of 90 % and more percent.

Please note that longer IN-lists will still make a the query run longer than when using shorter IN-lists. This is expected because longer IN-lists require more comparisons to be made and will lead (in the above example) to more documents being returned.