To JSON or not?
We’re often asked why ArangoDB uses JSON as its
data-interchange format for transferring documents from clients to
the database server and back. This is often accompanied by the
question if we could use
<insert fancy format here> instead.
In the following article, I’ll try to outline the reasons for why we picked JSON as the interchange format, and why we still use it.
I’ll start with a discussion of the pros and cons of JSON, look at some alternative formats and present my personal opinion on why using a different format may not provide too much benefit for us, at least at the moment.
This post does not intend to say that any of these formats are better or worse in general. I think there are applications for all of them.
However, I wanted to look at the different formats with our specific use case, i.e. a RESTful database, in mind.
What I don’t like about JSON
JSON is often criticized for its inefficiency and lack of real data types. I’ll often criticize it myself.
Following are my personal top 3 pain points.
Parsing and memory allocation
I have to admit that parsing JSON is painful from the efficiency
perspective. When the JSON parser encounters a
it will know this is the start of an object, but it has no idea how
many object members will follow and need to be stored with the
object. The same is true for lists (starting with
String values are no different: when the parser encounters a
the length of the string is still unknown. To determine the length
of the string, one must read until the end of the string, taking
into account escape sequences for special characters, e.g.
\\, but also Unicode escape sequences.
For example, the escaped 36-byte string
In K\u00f6ln, it's 15 \u00b0 outside
will be parsed into the 28-byte UTF-8 string
In Köln, it's 15 ° outside.
With the overall size of objects, lists or strings unknown at the start of a token, it’s hard to reserve the right amount of memory. Instead, memory either needs to be allocated on the fly as JSON tokens are parsed, or (potentially too big) chunk(s) of memory needs to be put aside at the start of parsing. The parser can then use this already allocated memory to store whatever is found afterwards.
JSON data can also become very fluffy. I already mentioned that serializing strings to JSON might incur some overhead due to escape sequences.
But there’s more things like this: each boolean value requires 4
true) or 5 (
false) bytes respectively. Repeating object member
names need to be stored repeatedly, as JSON does not provide string
interning or similar mechanisms.
Apart from that, the JSON type system is limited. There is only one type to represent numbers. Different types for representing numbers of different value ranges are (intentionally) missing. For example, one might miss 64 bit integer data types or arbitrary precision numbers. A date type (for calendar dates and times) is often missed, too.
And yes, binary data cannot be represented in JSON without converting them into a JSON string first. This may require base64-encoding or something similar.
In general, the available data types in JSON are very limited, and the format by itself is not extensible. Extending JSON with own type information will either create ill-formed JSON (read: non-portable) or would introduce special meaning members that other programs and tools won’t understand (read: non-portable).
Why still use JSON?
So what are the reasons to still stick with JSON? From my point of view, there are still a few good reasons to do so:
The JSON specification fits on five pages (including images). It is simple and intuitive.
Additionally, JSON-encoded data is instantly comprehensible. There is simply no need to look up the meanings of binary magic values in format specifications. It is also very easy to spot errors in ill-formed JSON.
In my eyes, looking at JSON data during a debugging session is much easier than looking at binary data (and I do look at binary data sometimes).
JSON requires no schema to be defined for data. This is good, as it allows to get something done earlier. Schemas also tend to change over time, and this can become a problem with other formats that have schemas. With schema-less JSON, a schema change becomes a no-brainer – just change the data inside the JSON and you’re done. No need to maintain a separate schema.
The schema-relaxed approach of JSON also plays quite well with languages that are loosely typed or allow runtime modifications of data structures. Most scripting languages are in this category.
For any other language without built-in support for JSON, it won’t be hard to find a robust implementation for JSON serialization/deserialization.
Many tools, including browsers, also support inspecting JSON data or can import or export JSON-encoded data.
*Pop quiz: does anyone remember what was the meaning of the “X” in AJAX??
As I have tried to outline above, I think JSON has both strengths and weaknesses. Is there an alternative format that is superior? I am listing a few candidate formats below, and try to assess them quickly.
Even if a format is well supported by most programming languages, there are other tools that should handle the format, too.
If there aren’t any tools that allow converting existing data into the format, then this is a severe limitation. Browsers, for example, are important tools. Most of the alternative formats cannot be inspected easily with a browser, which makes debugging data transfers from browser-based applications hard.
Additionally, one should consider how much example datasets are available. I think at the moment it’s much more likely that you’ll find a JSON-encoded dump of Wikipedia somewhere on the Internet than in one of the alternative formats.
An alternative to using JSON would be to create and our own binary format. We could use a protocol tailored to our needs, and make it very very efficient. The disadvantages of using a proprietary format are that it is nowhere supported, so writing clients for ArangoDB in another language becomes much harder for ourselves and for third-party contributors. Effectively, we would need to write an adapter for our binary protocol for each environment we want to have ArangoDB used in.
This sounds like it would take a lot of time and keep us from doing other things.
It’s human-readable, understandable, has a good standard type system and is extensible. But if you thought that JSON is already inefficient and verbose, try using XML and have fun. A colleague of mine even claimed that XML is not human-readable due to its chattyness.
There is also the Smile format. Its goals are to provide an efficient alternative to JSON. It looks good, but it does not seem to be used much outside of Jackson. As mentioned earlier, we need a format that is supported in a variety of environments.
Then there is BSON, made popular by MongoDB. We had a look at it. It is not as space-efficient as it could be, but it makes memory allocation very easy and allows for fast packing and unpacking. It is not so good when values inside the structure need to be updated. There are BSON-libraries for several languages
Msgpack so far looks like the most-promising alternative. It seems to become available in more and more programming language environments. The format also seems to be relatively efficient.
A major drawback is that as a binary format, it will still be hard to debug. Tool support is also not that great yet. Using Msgpack with a browser also sounds like fun. I’d like if tools like Firebug could display Msgpack packet internals.
Two years ago, we also experimented with Protocol buffers. Protocol buffers require to set up a schema for the data first, and then provide efficient means to serialize data from the wire into programming-language objects.
The problem is that there are no fixed schemas in a document database like ArangoDB. Users can structure their documents as they like. Each document can have a completely different structure.
We ended up defining a schema for something JSON-like inside Protocol buffers, and it did not make much sense in our use case.
There are alternative formats out there that address some of the issues that JSON has from my point of view. However, none of the other formats is yet that widely supported and easy to use as JSON.
This may change over time.
For our use case, it looks like Msgpack could fit quite well, but probably only as a second, alternative interface for highest-efficiency data transfers.