J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Schema Handling in ArangoDB

Note: this post is about the ArangoDB 2.x series

Schemas vs. schema-free

In a relational database, all rows in a table have the same structure. The structure is saved once for the table, and the invidiual rows only contain the row’s values. This is an efficient approach if all records have the exact same structure, i.e. the same attributes (same names and same data types).

Example records
1
2
3
4
firstName (varchar)  |  lastName (varchar)  |  status (varchar)
---------------------+----------------------+------------------
"fred"               |  "foxx"              |  "active"
"john"               |  "doe"               |  "inactive"

This is not a good fit if the data structure changes. In this case, an ALTER TABLE command would need to be issued in the relational database, converting all existing rows into the new structure. This is an expensive operation because it normally requires rewriting all existing rows.

The situation becomes really difficult when there is no definite structure for a table – if rows shall have a dynamic or variable structure, then it can be quite hard to define a sensible relational table schema!

This is where NoSQL databases enter the game – mostly they don’t require defining a schema for a “table” at all. Instead, each individual record will not only contain its data values, but also its own schema. This means much higher flexibility as every record can its completely own data structure.

This flexibility has a disadvantage though: storing schemas in individual records requires more storage space than storing the schema only once for the complete table. This is especially true if most (or even all) records in the table do have the same structure. A lot of storage space can be wasted while storing the same structure information again and again and again…

Schemas in ArangoDB

ArangoDB tries to be different in this respect: on the one hand it is a schema-free database and thus allows flexible storage. All documents in a collection (the ArangoDB lingo for record and table) can have the same or totally different structures. We leave this choice up to the user.

On the other hand, ArangoDB will exploit the similarities in document structures to save storage space. It will detect identical document schemas, and only save each unique schema once. This process is called shaping in ArangoDB.

Shaping

We optimized ArangoDB for this use case because we found that in reality, the documents in a collection will either have absolutely the same schema, or there will only be a few different schemas in use.

From the user perspective there are no schemas in ArangoDB: there is no way to create or alter the schema of a collection at all. Instead, ArangoDB will use the attribute names and data types contained in the JSON data of each document. All of this happens automatically.

For each new document in a collection, ArangoDB will first figure out the schema. It will then check if it has already processed a document with the same schema. If yes, then there is no need to save the schema information again. Instead, the new document will only contain a pointer to an already existing schema. This does not require much storage space.

If ArangoDB figures out that it has not yet processed a document with the same schema, it will store the document schema once, and store a pointer to the schema in the new document. This is a slightly more expensive operation, but it pays out when there are multiple documents in a collection with the same structure.

When ArangoDB looks at document schemas, it takes into account the attribute names and the attribute value data types contained in a document. All attribute names and data types in a document make the so-called shape.

Each shape is only stored once for each collection. Any attribute name used in a collection is also stored only once, and then reused from any shape that contains the attribute name.

Examples

The following documents do have different values but still their schemas are identical:

1
2
{ "name" : { "first" : "fred", "last" : "foxx" }, "status" : "active" }
{ "name" : { "first" : "john", "last" : "doe" }, "status" : "inactive" }

Both documents contain attributes named name and status. name is an array with two sub-attributes first and last, which are both strings. status also has string values in both documents.

ArangoDB will save this schema only once in a so-called shape. The documents will store their own data values plus a pointer to this (same) shape.

The next two documents have different, yet unknown schemas. ArangoDB will therefore store these two schemas in two new shapes:

1
2
{ "firstName" : "jack", "lastName" : "black", "status" : "inactive" }
{ "name" : "username", "status" : "unknown" }

We would end up with three diferent shapes for the four documents. This might not sound impressive, but if more documents are saved with one of the existing shapes, then storing each shape just once might really pay out.

A note on attribute names

Even though the latter two example documents had unique schemas, we saw in the examples that attribute names were already repeating. For example, all documents shown so far had an attribute named status, and some also had a name attribute.

ArangoDB figures out when attribute names repeat, and it will not store the same attribute name more than once in a collection. Given that many documents in a collection use a fixed set of repeating attribute names, this approach can lead to considerable storage space reductions.

As an aside, reusing attribute name information allows using descriptive (read: long) attribute names in ArangoDB with very low storage overhead.

For example, in ArangoDB it will not cost much extra space to use long attribute names like these in lots of documents:

1
{ "firstNameOfTheUser" : "jack", "lastNameOfTheUser" : "black" }

Each unique attribute name is only stored once per collection. In ArangoDB there is thus no need to artifically shorten the attribute names in data like it sometimes is done in other schema-free databases to save storage space:

1
{ "fn" : "jack", "ln" : "black" }

This artifical crippling of the attribute names makes the meaning of the attributes quite unclear and should be avoided. As mentioned, it is not necessary to do this in ArangoDB as it will save attribute names separate from attribute values, and repeating attribute names are not stored repeatedly.