J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

On Getting Unique Values

While paging through the issues in the ArangoDB issue tracker I came across issue #987, titled Trying to get distinct document attribute values from a large collection fails.

The issue was opened around 10 months ago when ArangoDB 2.2 was around. We improved AQL performance somewhat since then, so I was eager to see how the query would perform in ArangoDB 2.6, especially when comparing it to 2.2.

The Great AQL Shootout: ArangoDB 2.5 vs 2.6

We are currently preparing ArangoDB 2.6 for release. A lot of work has been put into this release, and I really hope we can ship a first 2.6 release soon.

To keep you hanging on in the meantime, I put together some performance tests results from 2.6. The tests I ran compared AQL query execution times in 2.6 and 2.5.

The results look quite promising: 2.6 outperformed 2.5 for all tested queries, mostly by factors of 2 to 5. A few dedicated AQL features in the tests got boosted even more, resulting in query execution time reductions of 90 % and more. Finally, the tests also revealed a dedicated case for which 2.6 provides a several hundredfold speedup.

Also good news is that not a single of the test queries ran slower in 2.6 than in 2.5.

Less Intrusive Linking

A while ago our continuous integration builds on TravisCI began to fail seemingly randomly because the build worker was killed without an apparent reason. Obviously the build process reached some resource limits though we couldn’t find any documented limit that the build obviously violated.

Some builds still succeeded without issues, but those builds that were killed had one thing in common: they were all stuck waiting the linker to finish.

The default linker used on TravisCI is GNU ld. After some research, it turned out that replacing GNU ld with GNU gold not only made the linking much faster, but also less resource-intensive. Linking ArangoDB on my local machine is almost twice as fast with gold as with ld. Even better, after reconfiguring our TravisCI builds to also use gold, our builds weren’t killed anymore by TravisCI’s build scheduling system.

To make TravisCI use gold instead of ld, add the following to your project’s .travis.yml in the install section (so it gets execute before the actual build steps):

commands for wrapping gold
1
2
3
4
5
6
7
sudo apt-get -y install binutils-gold
mkdir -p ~/bin/gold
echo '#!/bin/bash' > ~/bin/gold/ld
echo 'gold "$@"' >> ~/bin/gold/ld
chmod a+x ~/bin/gold/ld
export CFLAGS="-B$HOME/bin/gold $CFLAGS"
export CXXFLAGS="-B$HOME/bin/gold $CXXFLAGS"

The script downloads and installs gold and creates a tiny wrapper script in a file named ld in the user’s home directory. The wrapper simply calls gold with all the arguments passed to the wrapper. Finally, the script modifies the environments CFLAGS and CXXFLAGS by setting the -B parameter to the wrapper script’s directory.

-B is the option for the compiler’s search path. The compiler (g++) at least will look in this path for any helper tools it invokes. As we have a file named ld in this directory, g++ will use our wrapper script instead of the original ld binary. This way we can keep the original version of ld in /usr/bin, and only override it using environment variables. This is also helpful in other contexts, e.g. when ld shall remain as the system’s default linker but goldshall only be used for linking a few selected components.

Bulk Document Lookups

ArangoDB 2.6 comes with a specialized API for bulk document lookups.

The new API allows fetching multiple documents from the server using a single request, making bulk document retrieval more efficient than when using one request per document to fetch.

IN-list Improvements

We have worked on many AQL optimizations for ArangoDB 2.6.

As a side effect of one of these optimizations, some cases involving the handling of large IN-lists have become much faster than before. Large IN-lists are normally used when comparing attribute or index values against some big array of lookup values or keys provided by the application.

Fulltext Index Enhancements

This post is about improvements for the fulltext index in ArangoDB 2.6. The improvements address the problem that non-string attributes were ignored when fulltext-indexing.

Effectively this prevented string values inside arrays or objects from being indexed. Though this behavior was documented, it was limited the usefulness of the fulltext index much. Several users requested the fulltext index to be able to index arrays and object attributes, too.

Finally this has been accomplished, so the fulltext index in 2.6 supports indexing arrays and objects!

Subquery Optimizations

This is another post demonstrating some of the AQL query performance improvements that can be expected in ArangoDB 2.6. Specifically, this post is about an optimization for subqueries. AQL queries with multiple subqueries will likely benefit from it.

Return Value Optimization for AQL

While in search for further AQL query optimizations last week, we found that intermediate AQL query results were copied one time too often in some cases. Precisely, the data that a query’s ReturnNode will return to the caller was copied into the ReturnNode’s own register. With ReturnNodes never modifying their input data, this demanded for something that is called return-value optimization in compilers.

2.6 will now optimize away these copies in many cases, and this post shows which performance benefits can be expected due to the optimization.

Exporting Data for Offline Processing

A few weeks ago I wrote about ArangoDB’s specialized export API.

The export API is useful when the goal is to extract all documents from a given collection and to process them outside of ArangoDB.

The export API can provide quick and memory-efficient snapshots of the data in the underlying collection, making it suitable for extract all documents of the collection. It will be able to provide data much faster than with an AQL query that will extract all documents.

In this post I’ll show how to use the export API to extract data and process it with PHP.