J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Understanding Where Operations Are Executed

I recently had to deal with some data processing operation that took about 20 minutes to complete. When looking into this, I found that the easiest and most beneficial change to the whole setup was to make the operation a server-side operation instead of executing it client-side.

This change reduced the operation’s total execution time to a few seconds.

I can’t show the original processing task here, so I’ll start with a contrived example. Imagine the following for loop inserting 100K documents into a collection named test:

inserting 100k documents
1
2
3
for (i = 0; i < 100000; ++i) {
  db.test.save({ value: i });
}

Now we only need a client application to execute the operation. As I don’t have a presentable client application right now, I will use the ArangoShell as my client application.

What’s in a for loop?

Running the above for loop inside the ArangoShell will lead to the loop being executed inside the arangosh process.

In order to save a document in the collection, arangosh (our client) must make a call to the ArangoDB server. This means issuing an HTTP POST request to the server’s REST API at /_api/document/?collection=test. The server process will receive this request, insert the document, and respond with an HTTP status code 201 or 202 to our client. The client will then continue the loop until all documents have been inserted.

Now it’s easy to see that the simple 3-line loop will issue 100,000 HTTP requests in total. This means lots of data being pushed through the network stack(s). It is pretty easy to imagine that this will come at a cost.

If we instead execute the above loop directly inside the ArangoDB server, we can get rid of all the network overhead. The server has no need to send HTTP calls to itself. It can simply execute the 100K inserts and is then done. We therefore assume the loop to run somewhat faster when executed server-side.

A quick test on a crap laptop produced the following execution times for running the loops:

  • server-side execution (arangod): 1.34 seconds
  • client-side execution (arangosh): 17.32 seconds

Ouch. It looks like the client-server request-response overhead matters.

The following sections deal with how to get rid of some or even all the client-server ping pong.

Graph traversals

The above for loop example was contrived, but imagine running a client-side graph traversal instead. In fact, the original problem mentioned in the introduction has been a graph traversal.

The problem of a graph traversal is that is often iterative and highly dynamic. Decisions are made during the traversal as nodes are encountered, leading to dynamic inclusion or exclusion etc. This means that it makes sense to process nodes and edges only when needed, at the point when they are visited.

Even if the client can employ some sort of caching for already visited nodes, the client still needs to ask the server about each visited node’s connections at least once. Otherwise it could not follow them.

This normally means lots of requests and responses. Compare this to the single request-response alternative in which a client kicks off a server-side traversal, and finally receives the overal result once it is assembled.

Conclusion: traversals on anything but very small graphs should be run server-side. A server-side action (see below) is a good way to do this. Please note that running a server-side traversal does not mean giving up flexibility and control flow functionality. Server-side traversals remain highly configurable through custom JavaScript functions that allow implementation of user-defined business logic.

AQL queries

We won’t have your application send a series of 100,000 individual insert statements to the relational database of our choice. We already know from the past that this is going to be rather slow, so we have learned to avoid this. In the relational context, we rather use SQL queries that create or modify many rows in one go, e.g. an INSERT INTO ... SELECT ..., bulk inserts etc.

ArangoDB is no different. In general, you should try to avoid issuing lots of individual queries to the database from a client application. Instead and if the queries look alike, try converting multiple individual operations into a single AQL query. This will already save a lot of network overhead.

AQL provides multi-document operations to insert, update, and remove data. An overview is given here.

The above 100K inserts from the contrived example can easily be transformed into this single AQL query:

inserting 100k documents
1
FOR i IN 1..100000 INSERT { value: i } INTO test

Bulk imports

For importing larger amounts of documents from files, there is the specialized arangoimp import tool. It can load data from JSON and CSV files into ArangoDB. The tool is shipped with ArangoDB.

ArangoDB also provides a REST API for bulk imports of documents.

Joins

A special note about joins: the fact that several NoSQL databases do not provide join functionality has driven some people to emulate join functionality on the client-side, in their applications.

This can be a recipe for disaster: client-side join implementation might lead to horrendous amounts of queries that might need to be sent to the database for fetching all the records. More than that, if data are queried individually, the overall result may lack consistency. By the way, the same is true for fetching referenced or linked documents.

ArangoDB provides join functionality via AQL queries. Additionally, AQL queries can be used to fetch other documents with the original documents. Note that ArangoDB has no way of defining references or links between documents, but still AQL allows combining arbitrary documents in one query.

In almost all cases it make more sense to use an AQL query that performs joins or reference-fetching server-side and close to the data than having to deal with that on the application-side of things.

AQL joins are described here.

Server-side actions

With stored procedures, relational databases provide another way for an application to trigger the execution of a large amount of queries. Stored procedures are executed server-side, too, so they allow avoiding a lot of request-response ping pong between the application and the database, at least for defined tasks. Additionally, stored procedures provide control flow functionality, which can also be handy when operations depend on each other.

Coming back to ArangoDB: complex data-processing tasks that need to execute multiple operations or need control flow functionality might benefit if converted from multiple application-side operations into a single server-side action.

Server-side actions run inside the ArangoDB server, closer to the data, and can be much faster than a series of client-side operations. A server-side action is called with just one HTTP request from the application, so it may lead to saving lots of request-response cycles and reduction in network overhead. Apart from that, server-side actions in ArangoDB can employ transactions and provide the necessary control over isolation and atomicity when executing a series of operations.

Business logic and control flow functionality can be integrated easily because server-side actions in ArangoDB are JavaScript functions, with all of the language’s programming features being available.

But there’s even more to it: a single server-side operation can be written to put together its result in a format most convenient for the client application. This can also lead to better encapsulation, because all an application needs to know about a server-side action is its API or contract. Any internals of the action can be hidden from the client application. Overall, this supports a service-oriented approach.

To learn more about how to write server-side actions, please have a look at ArangoDB’s Foxx. It is all about making server-side actions available via REST APIs.