J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

How ArangoDB's Write-ahead Log Works

Since version 2.2, ArangoDB stores all data-modification operations in its write-ahead log (abbreviated WAL). The introduction of the WAL massively changed how data are stored in ArangoDB.

What’s in the WAL?

The WAL contains data of all data-modification operations that were executed in the ArangoDB server instance. Operations are written to the WAL in the order of execution. The following types of operations are logged to the WAL:

  • creating, updating, replacing or removing documents
  • creating, modifying or dropping collections and their indexes
  • creating or dropping databases

The WAL is used for all databases of an ArangoDB server. Database ids are stored in the WAL in order to tell data from different databases apart.

Recovery using the WAL

Should the ArangoDB server crash, it will replay its write-ahead logs at restart. Replaying the logs will make the server recover the same state of data as before the crash.

Any document-modification operations might belong to a transaction. Transaction data are also stored in the write-ahead log, allowing the recovery of committed transactions and preventing the recovery of aborted or unfinished transactions.

Let’s assume the following operations are executed in an ArangoDB server in this order…

1
2
3
4
5
6
7
8
9
10
11
Seq#  |  Operation type       |  Transaction#  |  Context
------+-----------------------+----------------+---------------------------------------
   1  |  start transaction    |           773  |  database "_system"
   2  |  insert document      |           773  |  collection "test", key "foo"
   3  |  start transaction    |           774  |  database "_system"
   4  |  insert document      |           774  |  collection "mycollection", key "bar"
   5  |  start transaction    |           775  |  database "_system"
   6  |  update document      |           775  |  collection "boom", key "test"
   7  |  abort transaction    |           774  |  -                
   8  |  remove document      |           773  |  collection "test", key "baz"
   9  |  commit transaction   |           773  |  -     

…and then the server goes down due to a power outage.

On server restart, the WAL contents will be replayed, so the server will redo the above operations. It will find out that operations #2 and #8 belong to transaction #773. Transaction #773 was already committed, so all of its operations must and will be recovered.

Further it will find out that operation #4 belongs to transaction #774, which was aborted by the user. Therefore, this operation will not be replayed but ignored.

Finally, it will find operation #6 belongs to transaction #775. For this transaction, there is neither an abort nor a commit operation in the log. Because the transaction was never committed, all of its operations are not replayed at restart and the server will behave as if the transaction never happened.

WAL and replication

A side-effect of having a write-ahead log is that it can also be used for replication. When a slave server fetches the latest changes from the master, the master can simply read the operations from its WAL. Data in the WAL are self-contained, meaning the master can efficiently compile the list of changes using only the WAL and without performing lookups elsewhere.

The WAL is there and will be used anyway, enabling any ArangoDB server to be used as a replication master without any configuration. Previous versions of ArangoDB (without the WAL) required setting up an extra component for replication logging. This requirement is now gone.

Organization of the WAL

The WAL is actually a collection of logfiles. Logfiles are named logfile-xxxx.db (with xxxx being the logfile’s id). Logfiles with lower ids are older than logfiles with higher ids. By default, the logfiles reside in the journals sub-directory of ArangoDB’s database directory.

At any point in time, one of the logfiles will be the active logfile. ArangoDB will write all data-modifications to the active logfile. Writing is append-only, meaning ArangoDB will never overwrite existing logfile data. To ensure logfile integrity, a CRC32 checksum is calculated for each logfile entry. This checksum is validated when a logfile is replayed. When there is a checksum mismatch, this indicates a disk error or an incompletely written operation – in both cases it won’t be safe to recover and replay the operation.

If an operation can’t be written into the active logfile due to lack of space, the active logfile will be closed and a new logfile will become the active logfile.

A background thread will open new logfiles before the current active one is fully filled up. This is done to ensure that no waiting is required when there is a switch of the active logfile.

By default, each logfile has a size of 32 MB, allowing lots of operations to be stored in it. If you want to adjust the default size, the option --wal.logfile-size is for you.

Logfile synchronization

Writes to logfiles are synchronized to disk automatically in a configurable interval (the option to look for is --wal.sync-interval). To get immediate synchronization of operations, operations can be run with the waitForSync attribute set to true, or on collections with the waitForSync attribute being set.

For example, the following operations will have been synchronized to disk when the operations return:

1
2
3
4
5
6
7
8
// operation will be synchronized because the `waitForSync` attribute 
// is set on operation level
db.test.save({ foo: "bar" }, { waitForSync: true });

// operation will be synchronized because the `waitForSync` attribute 
// is set on collection level
db.mycollection.properties({ waitForSync: true });
db.mycollection.save({ foo: "bar" });

When no immediate synchronization has been requested, ArangoDB will have a background thread periodically call msync for not-yet synchronized logfile regions. Multiple operations are synchronized together because they reside in adjacent memory regions. That means automatic synchronization can get away with far less calls to msync than there are operations.

Storage overhead

Documents stored in the WAL (as part of an insert, update or replace operation) are stored in a format that contains the document values plus the document’s shape. This allows reading a document fully from a WAL entry without looking up shape information elsewhere, making it faster and also more reliable.

Storing shape information in the WAL has a storage space overhead though. The overhead should not matter much if a logfile contains a lot of documents with identical shapes. ArangoDB will make sure each shape is only stored once per WAL logfile. This has turned out to be a rather good solution: it reduces WAL storage space requirements greatly, and still is reliable and fast, as shape lookups are local to the current WAL logfile only.

The overhead of storing shape information in the WAL will matter most when documents have completely different shapes. In this case, no shape information will ever be re-used. While this may happen in benchmarks with synthetic data, we found that in reality there are often lots of identically-structured documents and thus a lot of potential for re-using shapes.

Note that storing shape information in the WAL can be turned off to reduce overhead. ArangoDB provides the option --wal.suppress-shape-information for this purpose. When set to true, no shape information will be written to the WAL. Note that by default, the option is set to false and that the option shouldn’t be changed if the server is to be used as a replication master. If documents aren’t too heterogenous, setting the option to true won’t help much. It will help a lot if all documents that are stored have different shapes (which we consider unrealistic, but we still provide the option to reduce overhead in this case).

WAL cleanup

WAL logfiles that are completely filled are subject to garbage collection. WAL garbage collection is performed by a separate garbage collector thread. The thread will copy over the still-relevant operations into the collection datafiles. After that, indexes will be adjusted to point to the new storage locations. Documents that have become obsolete due to later changes will not be copied from the WAL into the collection datafiles at all.

Garbage-collected logfiles are deleted by ArangoDB automatically if there exist more of these “historic” logfiles than configured. The number of historic logfiles to keep before deletion is configured using the option --wal.historic-logfiles.

If no replication is to be used, there is no need to keep any historic logfiles. They have no purpose but to provide a history of recent changes. The more history there is on a master server, the longer is the period for which slave servers can request changes for. How much history is needed depends on how reliable the network connection between a replication slave and the master is. If the network connection is known to fail periodically, it may be wise to keep a few historic logfiles on the master, so the slave can catch up from the point it stopped when the network connection is re-established.

If network connections are reliable or no replication is to be used at all, the number of historic logfiles can be set to a low value to save disk space.

Side-effects of the WAL

The WAL can be used for replication, removing the requirement to explicitly turn on the separate logging of operations for replication purposes. This is a clear improvement over previous versions of ArangoDB.

The introduction of the WAL also caused a few other minor changes:

While documents are stored in a WAL logfile, their sizes won’t be included in the output of the figures method of the collection. When a WAL logfile gets garbage-collected, documents will physically be moved into the collection logfiles and the figures will be updated.

Note that the output of the count method is not affected by whether a document is stored in the WAL or in a collection logfile.

Another side-effect of storing operations in the WAL first is that no collection logfiles will be created when the first document is inserted. So there will be collections with documents but without any logfiles, at least temporarily until the WAL garbage collection kicks in and will transfer data from the WAL to the collection logfiles.