J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

How V8 Is Used in ArangoDB

ArangoDB allows running user-defined JavaScript code in the database. This can be used for more complex, stored procedures-like database operations. Additionally, ArangoDB’s Foxx framework can be used to make any database functionality available via an HTTP REST API. It’s easy to build microservices with it, using the scripting functionality for tasks like access control, data validation, sanitation etc.

We often get asked how the scripting functionality is implemented under the hood. Additionally, several people have asked how ArangoDB’s JavaScript functionality relates to node.js.

This post tries to explain that in detail.

The C++ parts

arangosh, the ArangoShell, and arangod, the database server, are written in C++ and they are shipped as native code executables. Some parts of both arangosh and arangod itself are written in JavaScript (more on that later).

The I/O handling in arangod is written in C++ and uses libev (written in C) for the low-level event handling. All the socket I/O, working scheduling and queueing is written in C++, too. These are parts that require high parallelism, so we want this to run in multiple threads.

All the indexes, the persistence layer and many of the fundamental operations, like the ones for document inserts, updates, deletes, imports are written in C++ for effective control of memory usage and parallelism. AQL’s query parser is written using the usual combination of Flex and Bison, which generate C files that are compiled to native code. The AQL optimizer, AQL executor and many AQL functions are writting in C++ as well.

Some AQL functions however, are written in JavaScript. And if an AQL query invokes a user-defined function, this function will be a JavaScript function, too.

How ArangoDB uses V8

How is JavaScript code executed in ArangoDB?

Both arangosh and arangod are linked against the V8 JavaScript engine library. V8 (itself written in C++) is the component that runs the JavaScript code in ArangoDB.

V8 requires JavaScript code to run in a so-called isolate (note: I’ll be oversimplifying a bit here – in reality there are isolates and contexts). As the name suggests, isolates are completely isolated from each other. Especially, data cannot be shared or moved across isolates, and each isolate can be used by only one thread at a time.

Let’s look at how arangosh, the ArangoShell, uses V8. All JavaScript commands entered in arangosh will be compiled and executing with V8 immediately. In arangosh, this happens using a single V8 isolate.

On the server side, things are a bit different. In arangod, there are multiple V8 isolates. The number of isolates to create is a startup configuration option (--javascript.v8-contexts). Creating multiple isolates allows running JavaScript code in multiple threads, truly parallel. Apart from that, arangod has multiple I/O threads (--scheduler.threads configuration option) for handling the communication with client applications.

As mentioned earlier, part of ArangoDB’s codebase itself is written in JavaScript, and this JavaScript code is executed the same way as any user-defined will be executed.

Executing JavaScript code with V8

For executing any JavaScript code (built-in or user-defined), ArangoDB will invoke V8’s JIT compiler to compile the script code into native code and run it.

The JIT compiler in V8 will not try extremely hard to optimize the code on the first invocation. On initial compilation, it will aim for a good balance of optimizations and fast compilation time. If it finds some code parts are called often, it may re-try to optimize these parts more aggressively automatically. To make things even more complex, there are different JIT compilers in V8 (i.e. Crankshaft and Turbofan) with different sweet spots. JavaScript modes (i.e. strict mode and strong mode) can also affect the level of optimizations the compilers will carry out.

Now, after the JavaScript code has been compiled to native code, V8 will run it until it returns or fails with an uncaught exception.

But how can the JavaScript code access the database data and server internals? In other words, what actually happens if a JavaScript command such as the following is executed?

example JavaScript command
1
db.myCollection.save({ _key: "test" });

Accessing server internals from JavaScript

Inside arangod, each V8 isolate is equipped with a global variable named db. This JavaScript variable is a wrapper around database functionality written in C++. When the db object is created, we tell V8 that its methods are C++ callbacks.

Whenever the db object is accessed in JavaScript, the V8 engine will therefore call C++ methods. These provide full access to the server internals, can do whatever is required and return data in the format that V8 requires. V8 then makes the return data accessible to the JavaScript code.

Executing db.myCollection.save(...) is effectively two operations: accessing the property myCollection on the object db and then calling function save on that property. For the first operation, V8 will invoke the object’s NamedPropertyHandler, which is a C++ function that is responsible for returning the value for the property with the given name (myCollection). In the case of db, we have a C++ function that collection object if it exists, or undefined if not.

The collection object again has C++ bindings in the background, so calling function save on it will call another C++ function. The collection object also has a (hidden) pointer to the C++ collection. When save is called, we will extract that pointer from the this object so we know which C++ data structures to work on. The save function will also get the to-be-inserted document data as its payload. V8 will pass this to the C++ function as well so we can validate it and convert it into our internal data format.

On the server side, there are several objects exposed to JavaScript that have C++ bindings. There are also non-object functions that have C++ bindings. Some of these functions are also bolted on regular JavaScript objects.

Accessing server internals from ArangoShell

When running the same command in arangosh, things will be completely different. The ArangoShell may run on the same host as the arangod server process, but it may also run on a completely different one. Providing arangosh access to server internals such as pointers will therefore not work in general. Even if arangosh and arangod do run on the same host, they are independent processes with no access to the each other’s data. The latter problem could be solved by having a shared memory segment that both arangosh and arangod can use, but why bother with that special case which will provide no help in the general case when the shell can be located on any host.

To make the shell work in all these situations, it uses the HTTP REST API provided by the ArangoDB server to talk to it. For arangod, any ArangoShell client is just another client, with no special treatments or protocols.

As a consequence, all operations on databases and collections run from the ArangoShell are JavaScript wrappers that call their respective server-side HTTP APIs.

Recalling the command example again (db.myCollection.save(...)), the shell will first access the property myCollection of the object db. In the shell db is a regular JavaScript object with no C++ bindings. When the shell is started, it will make an HTTP call to arangod to retrieve a list of all available collections, and register them as properties in its db object. Calling the save method on one of these objects will trigger an HTTP POST request to the server API at /_api/document?collection=myCollection, with the to-be-inserted data in its request body. Eventually the server will respond and the command will return with the data retrieved from the server.

Considerations

Consider running the following JavaScript code:

code to insert 1000 documents
1
2
3
for (var i = 0; i < 1000; ++i) {
  db.myCollection.save({ _key: "test" + i });
}

When run from inside the ArangoShell, the code will be executed in there. The shell will perform an HTTP request to arangod for each call to save. We’ll end up with 1,000 HTTP requests.

Running the same code inside arangod will trigger no HTTP requests, as the server-side functions are backed with C++ internals and can access the database data directly. It will be a lot faster to run this loop on the server than in arangosh. A while ago I wrote another article about this.

When replacing the ArangoShell with another client application, things are no different. A client application will not have access to the server internals, so all it can do is to make requests to the server (by the way, the principle would be no different if we used MySQL or other database servers, only the protocols would vary).

Fortunately, there is a fix for this: making the code run server-side. For example, the above code can be put into a Foxx route. This way it is not only fast but will be made accessible via an HTTP REST API so client applications can call it with a single HTTP request.

In reality, database operations will be more complex than in the above example. And this is where having a full-featured scripting language like JavaScript helps. It provides all the features that are needed for more complex tasks such as validating and sanitizing input data, access control, executing database queries and postprocessing results.

The differences to node.js

To start with: ArangoDB is not node.js, and vice versa. ArangoDB is not a node.js module either. ArangoDB and node.js are completely indepedent.

But there is a commonality: both ArangoDB and node.js use the V8 engine for running JavaScript code.

Threading

AFAIK, standard node.js only has a single V8 isolate to run all code in. While that made the implementation easier (no hassle with multi-threading) it also limits node.js to using only a single CPU.

It’s not unusual to see a multi-core server with a node.js instance maxing out one CPU while the other CPUs are sitting idle. In order to max out a multi-core server, people often start multiple node.js instances on a single server. That will work fine, but the node.js instances will be independent, and sharing data between them is not possible in plain JavaScript.

And because a node.js instance is single-threaded, it is also important that code written for node.js is non-blocking. Code that blocks while waiting for some I/O operation would block the only available CPU. Using non-blocking I/O operations allows node.js to queue the operation, and execute other code in the meantime, allowing overall progress. This also makes it look like it would be executing multiple actions in parallel, while it is actually executing them sequentially.

Contrary, arangod is a multi-threaded server. It can serve multiple requests in parallel, using multiple CPUs. Because arangod has multiple V8 isolates that each can execute JavaScript code, it can run JavaScript in multiple threads in parallel.

arangosh, the ArangoShell, is single-threaded and provides only a single V8 isolate.

Usage of modules

Both node.js and ArangoDB can load code at runtime so it can be organized into modules or libraries. In both, extra JavaScript modules can be loaded using the require function.

There is often confusion about whether node.js modules can be used in ArangoDB. This is probably because the answer is “it depends!”.

node.js packages can be written in JavaScript but they can also compile to native code using C++. The latter can be used to extend the functionality of node.js with features that JavaScript alone wouldn’t be capable of. Such modules however often heavily depend on a specific V8 version (so do not necessarily compile in a node.js version with a different version of V8) and often rely on node.js internals.

ArangoDB can load modules that are written in pure JavaScript. Modules that depend on non-JavaScript functionality (such as native modules for node.js) or modules that rely on node.js internals cannot be loaded in ArangoDB. As a rule of thumb, any module will run in ArangoDB that is implemented in pure JavaScript, does not access global variables and only requires other modules that obey the same restrictions.

ArangoDB also uses several externally maintained JavaScript-only libraries, such as underscore.js. This module will run everywhere because it conforms to the mentioned restrictions.

ArangoDB also uses several other modules that are maintained on npm.js. An example module is AQB, a query builder for AQL. It is written in pure JavaScript too, so it can be used from a node.js application and from within ArangoDB. If there is an updated version of this module, we use npm to install it in a subdirectory of ArangoDB. As per npm convention, the node.js modules shipped with ArangoDB reside in a directory named node_modules. Probably this is what caused some of the confusion.