I just came across a question about how to create highscore lists or leaderboards in ArangoDB, and how they would work when compared to Redis sorted sets.
This blog post tries to give an answer on the topic and also detailed instructions and queries for setting up highscore lists with ArangoDB.
A highscore list in Redis
Highscore lists are normally used to quickly determine who’s currently at the top, so we obviously need some sorted data structure.
Redis has a specialized datatype named sorted set which can be used for exactly this purpose. A sorted set in Redis is a value consisting of multiple key/value pairs, and that is automatically sorted by values. The sorted set is stored under a key so it can be accessed as a whole.
Here’s how one would create a sorted set named highscores
and populate
it with 5 key/value pairs in Redis (using redis-cli):
1
|
|
Adding a new entry to a sorted set is done using ZADD
too.
Inserting into a Redis sorted set has logarithmic complexity.
Updating a score in the sorted set is done using ZINCRBY
. This command works
regardless of whether the to-be-updated key already exists in the sorted set.
If it exists, its score will be increased by the specified value, and if it does
not exist, it will be created with the specified value:
1 2 |
|
In this case the return value 1
indicates that a new key was added to the set
and that it didn’t update an existing one.
Querying the entries with the lowest scores from a Redis sorted set is trivial.
The ZRANGE
command will query the entries in the sorted set from lowest to
highest score. As the entries are already stored in sorted order, this is very
efficient.
The following command queries the bottom 3 keys from the sorted set:
1 2 3 4 |
|
For querying in reverse order, there is ZREVRANGE
. Both commands can be
accompanied by the WITHSCORES
flag to also return the associated values (i.e.
the scores). Here are the top 3 key/value pairs in the sorted set:
1 2 3 4 5 6 7 |
|
For removing an entry from a sorted set there is ZREM
:
1 2 |
|
There are many more specialized Redis commands for working with sorted sets. The
Redis commands prefixed with a Z
are sorted set
commands.
A highscore list in ArangoDB
Now let’s try to mimic that with ArangoDB.
In ArangoDB, there is no such thing as a sorted set and no direct equivalent. Instead, data in ArangoDB are stored in collections. Collections are a general-purpose storage mechanism and they are not limited to storing just scores.
We also need a mechanism for keeping highscores sorted. By default, no
specific sort order is maintained for data in a collection. To have the
collection entries sorted by highscore values, we have to explicitly create
a (sorted) skiplist index on some attribute. We’ll use an attribute named
score
for this.
The following shell commands create the collection and the index on score
:
1 2 |
|
Once the collection is set up, we can switch to AQL for the following operations (though we could achieve the same with Shell commands).
To insert the same initial data as in the Redis case, we can run the following five AQL queries:
1 2 3 4 5 |
|
Note that I have been using the _key
attribute for saving the user id. Using the
_key
attribute is normally beneficial because it is the collection’s primary key.
It is always present and automatically unique, so exactly what we need for maintaining
a highscore list. Note that there are some restrictions for what can be stored inside
the _key
attribute, but as long as values are only ASCII letters or digits, there
is nothing to worry about.
Inserting into the collection will also automatically populate the indexes. Inserting into a skiplist should have about logarithmic complexity on average (though this is not guaranteed – this is because the skiplist is a probabilistic data structure and internally it will be flipping coins. In theory there is a chance that it becomes badly balanced. But in practice it should be quite close to an average logarithmic complexity).
As we have some initial documents, we can now query the lowest and highest scores.
This will also be efficient as the queries will use the sorted index on score
:
1 2 3 4 |
|
1 2 3 4 |
|
To store a highscore for a user without knowing in advance whether a value has already
been stored before for this user, one can use UPSERT
. The UPSERT
will either insert
a new highscore entry, or update an existing one if already present:
1 2 3 4 |
|
If there is already an entry with a key max
, its scores will be increased by 80.
If such entry does not exist, it will be created. In both cases, the new score will
be returned.
Note: the UPSERT
command has been added in ArangoDB version 2.6.
Finally, removing an entry from a highscore list is a straight-forward remove operation:
1
|
|
Extensions
We’ll now build on this simple example and create slightly more advanced highscore list use cases. The following topics will be covered:
- multi-game highscore lists
- joining data
- maintaining a “last updated” date
Multi-game highscore lists
We’ll start with generalizing the single-game highscore list into a multi-game highscore list.
In Redis, one would create multiple sorted sets for handling the highscore lists of multiple games. Multiple Redis sorted sets are stored under different keys, so they are isolated from each other.
Though Redis provides a few commands to aggregate data from multiple sorted sets
(ZUNIONSTORE
and ZINTERSTORE
) into a new sorted set, other cross-set operations are
not supported. This is not a problem if the client application does not have to
perform cross-set queries or cleanup tasks.
In ArangoDB, multi-game highscore lists can be implemented in two variants.
In order to decide which variant is better suited, we need to be clear about whether
all highscores should be stored in the same collection or if we prefer using multiple
collections (e.g. one per game).
Storing highscores for different games in separate collections has the advantage that they’re really isolated. It is easy to get rid of a specific highscore list by simply dropping its collection. It is also easy to get right query-wise.
All that needs to be changed to turn the above examples into a multi-game highscore
list solution is to change the hard-coded collection name highscores
and make it a
bind parameter, so the right collection name can be injected by the client application
easily.
On the downside, the multi-collection solution will make cross-game operations difficult.
Additionally, having one collection per game may get out of hand when there are many,
many highscore lists to maintain. In case there are many but small highscore lists to
maintain, it might be better to put them into a single collection and add a game
attribute to tell the individual lists apart in it.
Let’s focus on this and put all highscores of all games into a single collection.
The first adjustment that needs to be made is that we cannot use _key
for user ids
anymore. This is because user ids may repeat now (a user may be contained in more than
one list). So we will change the design and make the combination of game
and user
a new unique key:
1 2 3 4 |
|
We can use the unique hash index on user
and game
to ensure there is at most one entry
for per user per game. It also allows use to find out quickly whether we already have
an entry for that particular combination of game and user. Because we are not using
_key
we could now also switch to numeric ids if we preferred that.
The other index on game
and score
is sorted. It can be used to quickly retrieve the
leaderboard for a given game. As it is primarily sorted by game
, it can also be used
to enumerate all entries for a given game.
The following Shell command populates the multi-game highscores collection with 55,000 highscores:
1 2 3 4 5 6 7 8 9 |
|
The game ids used above are between 0 and 9, though any other game ids would work, too. User ids are stringified numbers.
We can now find out the leaderboard for game 2 with the following adjusted AQL query. The query will use the (sorted) skiplist index:
1 2 3 4 5 |
|
Removing all scores for a specific game is also efficient due to the the same index:
1 2 3 |
|
On a side note: when storing all highscores in the same collection, we could also
run cross-game queries if we wanted to. All that needs to be done for this is adjusting
the FILTER
conditions in the queries.
Inserting or updating a user score can be achieved using an UPSERT
.
Here’s a query to increase the score of user "1571"
in game 2
by a value of 5:
1 2 3 4 |
|
The same index on [ "user", "game" ]
is used in the following query that will
delete the highscore of a given user in a specific game:
1 2 3 4 |
|
Joining data
Querying the leaderboard for a specific game was easy. However, so far we have only queried user ids and associated scores in games. In reality, we probably want to display some more user information in a leaderboard, for example their screen names.
In Redis, no extra information can be stored in sorted sets. So extra user information must be stored under separate keys. There is no concept of joins in Redis. The scores contained in the sorted set need to be queried by the client application, and extra user information have to be queried by the client application separately.
In ArangoDB, we could store the screen names in the highscores collection along with the highscores so we can easily query them with the leaderboard query. This is also how it would be done in MongoDB due to the absence of joins there.
While this would work, it will create lots of redundant data if the screen names are also used and stored elsewhere.
So let’s pick the option that stores highscores and screen names in separate places, and brings them together only when needed in a leaderboard query.
Let’s store screen names in a collection named users
. The following Shell commands
will create the collection and set up 100K users with dummy screen names:
1 2 3 4 5 6 7 |
|
We can now query the highscores plus the screen name in one go:
1 2 3 4 5 6 7 |
|
Maintaining a “last updated” date
Finally, let’s try to keep track of when a highscore was last updated. There are a few use cases for this, for example displaying the date and time of when a highscore was achieved or for revmoing older highscores.
In Redis, the sorted set values are just the numeric scores, so we cannot store anything else (such as a date) inside the sorted sets. We would really need to store the update date for each highscore entry outside the sorted set, either under a separate key, or using a Redis hash. However, this is complex to manage and keep consistent so we won’t do it.
For implementing the automatic expiration, it would be good if we could use the built-in automatic key expiration of Redis. Each key can optionally be given a time-to-live or an expiration date, and it will automatically expire and vanish then without further ado. This may be exactly what we need to remove older highscore entries, but we cannot use it. The reason is that expiration only works for keys at the top level, but not for individual keys inside a sorted set. So we cannot really implement this sanely.
Let’s switch to ArangoDB now. Here we work with arbitrarily structured documents.
That means we can store any other attributes along with a highscore. We can store the
timestamp of when a highscore was last set or updated in an attribute named date
:
1 2 3 4 5 |
|
The date
attribute can now be used for display purposes already.
We can also use the date
attribute for identifying the oldest entries in the
highscore list in case we want the list to be periodically cleaned up.
Obviously we will be indexing date
for this, but we need to decide whether we want to use
the same expiration periods for all games, or if we want to use game-specific expirations.
If the expiration date is the same for all games, then we can index just date
:
1
|
|
If we now want to remove entries older than roughly 2 days, regardless of the associated game, the removal query looks like this:
1 2 3 4 5 |
|
If we instead want to find (and remove) the oldest entries for individual games,
we need to create the index on game
and date
:
1
|
|
This index allows to efficiently get rid of the oldest entries per game:
1 2 3 4 5 6 |
|
On a side note: the REMOVE
was limited to the oldest 1000 entries. This
was done to make the query return fast. The removal query can be repeated while
there are still entries to remove.