To JSON or not?

We’re often asked why ArangoDB uses JSON as its data-interchange format for transferring documents from clients to the database server and back. This is often accompanied by the question if we could use <insert fancy format here> instead.

In the following article, I’ll try to outline the reasons for why we picked JSON as the interchange format, and why we still use it.

I’ll start with a discussion of the pros and cons of JSON, look at some alternative formats and present my personal opinion on why using a different format may not provide too much benefit for us, at least at the moment.

This post does not intend to say that any of these formats are better or worse in general. I think there are applications for all of them.

However, I wanted to look at the different formats with our specific use case, i.e. a RESTful database, in mind.

What I don’t like about JSON

JSON is often criticized for its inefficiency and lack of real data types. I’ll often criticize it myself.

Following are my personal top 3 pain points.

Parsing and memory allocation

I have to admit that parsing JSON is painful from the efficiency perspective. When the JSON parser encounters a { token, it will know this is the start of an object, but it has no idea how many object members will follow and need to be stored with the object. The same is true for lists (starting with [).

String values are no different: when the parser encounters a ", the length of the string is still unknown. To determine the length of the string, one must read until the end of the string, taking into account escape sequences for special characters, e.g. \/, \n, \t, \\, but also Unicode escape sequences.

For example, the escaped 36-byte string In K\u00f6ln, it's 15 \u00b0 outside will be parsed into the 28-byte UTF-8 string In Köln, it's 15 ° outside.

With the overall size of objects, lists or strings unknown at the start of a token, it’s hard to reserve the right amount of memory. Instead, memory either needs to be allocated on the fly as JSON tokens are parsed, or (potentially too big) chunk(s) of memory needs to be put aside at the start of parsing. The parser can then use this already allocated memory to store whatever is found afterwards.

Verbosity

JSON data can also become very fluffy. I already mentioned that serializing strings to JSON might incur some overhead due to escape sequences.

But there’s more things like this: each boolean value requires 4 (true) or 5 (false) bytes respectively. Repeating object member names need to be stored repeatedly, as JSON does not provide string interning or similar mechanisms.

Data types

Apart from that, the JSON type system is limited. There is only one type to represent numbers. Different types for representing numbers of different value ranges are (intentionally) missing. For example, one might miss 64 bit integer data types or arbitrary precision numbers. A date type (for calendar dates and times) is often missed, too.

And yes, binary data cannot be represented in JSON without converting them into a JSON string first. This may require base64-encoding or something similar.

In general, the available data types in JSON are very limited, and the format by itself is not extensible. Extending JSON with own type information will either create ill-formed JSON (read: non-portable) or would introduce special meaning members that other programs and tools won’t understand (read: non-portable).

Why still use JSON?

So what are the reasons to still stick with JSON? From my point of view, there are still a few good reasons to do so:

Simplicity

The JSON specification fits on five pages (including images). It is simple and intuitive.

Additionally, JSON-encoded data is instantly comprehensible. There is simply no need to look up the meanings of binary magic values in format specifications. It is also very easy to spot errors in ill-formed JSON.

In my eyes, looking at JSON data during a debugging session is much easier than looking at binary data (and I do look at binary data sometimes).

Flexibility

JSON requires no schema to be defined for data. This is good, as it allows to get something done earlier. Schemas also tend to change over time, and this can become a problem with other formats that have schemas. With schema-less JSON, a schema change becomes a no-brainer – just change the data inside the JSON and you’re done. No need to maintain a separate schema.

The schema-relaxed approach of JSON also plays quite well with languages that are loosely typed or allow runtime modifications of data structures. Most scripting languages are in this category.

Language support

JSON is supported in almost every environment. Support for JSON is sometimes built into languages directly (JavaScript) or the languages come with built-in serialization and deserialization functions (e.g. PHP). Just go and use it.

For any other language without built-in support for JSON, it won’t be hard to find a robust implementation for JSON serialization/deserialization.

In the ArangoDB server, we use a lot of JavaScript code ourselves. Users can also extend the server functionality with JavaScript. Guess what happens when a JSON request is sent to the server and its payload is handed to a JavaScript-based action handler in the server? Yes, we’ll take the request body and create JavaScript objects from it. This is as simple as it can be, because we have native JSON support in JavaScript, our server-side programming language.

We also encourage users to use ArangoDB as a back end for their JavaScript-based front ends. Especially when running in a browser, using JSON as the interchange format inside AJAX* requests makes sense. You don’t want to load serialization/deserialization libraries that handle binary format into front ends for various reasons.

Many tools, including browsers, also support inspecting JSON data or can import or export JSON-encoded data.

*Pop quiz: does anyone remember what was the meaning of the “X” in AJAX??

Alternatives

As I have tried to outline above, I think JSON has both strengths and weaknesses. Is there an alternative format that is superior? I am listing a few candidate formats below, and try to assess them quickly.

One thing that they all have in common is that they are not as much supported by programming languages and tools as JSON is at the moment. For most of the alternative formats, you would have to install some library in the environment of your choice. XML is already available in many environments by default, with the notable exception of JavaScript.

Even if a format is well supported by most programming languages, there are other tools that should handle the format, too.

If there aren’t any tools that allow converting existing data into the format, then this is a severe limitation. Browsers, for example, are important tools. Most of the alternative formats cannot be inspected easily with a browser, which makes debugging data transfers from browser-based applications hard.

Additionally, one should consider how much example datasets are available. I think at the moment it’s much more likely that you’ll find a JSON-encoded dump of Wikipedia somewhere on the Internet than in one of the alternative formats.

Proprietary format

An alternative to using JSON would be to create and our own binary format. We could use a protocol tailored to our needs, and make it very very efficient. The disadvantages of using a proprietary format are that it is nowhere supported, so writing clients for ArangoDB in another language becomes much harder for ourselves and for third-party contributors. Effectively, we would need to write an adapter for our binary protocol for each environment we want to have ArangoDB used in.

This sounds like it would take a lot of time and keep us from doing other things.

XML

It’s human-readable, understandable, has a good standard type system and is extensible. But if you thought that JSON is already inefficient and verbose, try using XML and have fun. A colleague of mine even claimed that XML is not human-readable due to its chattyness.

XML also hasn’t been adopted much in the JavaScript community, and we need to find a format that plays nicely with JavaScript.

Smile

There is also the Smile format. Its goals are to provide an efficient alternative to JSON. It looks good, but it does not seem to be used much outside of Jackson. As mentioned earlier, we need a format that is supported in a variety of environments.

BSON

Then there is BSON, made popular by MongoDB. We had a look at it. It is not as space-efficient as it could be, but it makes memory allocation very easy and allows for fast packing and unpacking. It is not so good when values inside the structure need to be updated. There are BSON-libraries for several languages

Still, it is a binary format. Using it for communication in the ArangoDB cases includes using it from arbitrary JavaScript programs (including applications run in a browser), using it in AJAX calls etc. This sounds a bit like debugging hell.

Msgpack

Msgpack so far looks like the most-promising alternative. It seems to become available in more and more programming language environments. The format also seems to be relatively efficient.

A major drawback is that as a binary format, it will still be hard to debug. Tool support is also not that great yet. Using Msgpack with a browser also sounds like fun. I’d like if tools like Firebug could display Msgpack packet internals.

Protocol buffers

Two years ago, we also experimented with Protocol buffers. Protocol buffers require to set up a schema for the data first, and then provide efficient means to serialize data from the wire into programming-language objects.

The problem is that there are no fixed schemas in a document database like ArangoDB. Users can structure their documents as they like. Each document can have a completely different structure.

We ended up defining a schema for something JSON-like inside Protocol buffers, and it did not make much sense in our use case.

Conclusion

There are alternative formats out there that address some of the issues that JSON has from my point of view. However, none of the other formats is yet that widely supported and easy to use as JSON.

This may change over time.

For our use case, it looks like Msgpack could fit quite well, but probably only as a second, alternative interface for highest-efficiency data transfers.

J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Why JSON?