J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Creating Highscore Lists

I just came across a question about how to create highscore lists or leaderboards in ArangoDB, and how they would work when compared to Redis sorted sets.

This blog post tries to give an answer on the topic and also detailed instructions and queries for setting up highscore lists with ArangoDB.

A highscore list in Redis

Highscore lists are normally used to quickly determine who’s currently at the top, so we obviously need some sorted data structure.

Redis has a specialized datatype named sorted set which can be used for exactly this purpose. A sorted set in Redis is a value consisting of multiple key/value pairs, and that is automatically sorted by values. The sorted set is stored under a key so it can be accessed as a whole.

Here’s how one would create a sorted set named highscores and populate it with 5 key/value pairs in Redis (using redis-cli):

creating a sorted set in Redis
1
> ZADD highscores frank 50 jan 20 willi 35 thomas 75 ingo 60

Adding a new entry to a sorted set is done using ZADD too. Inserting into a Redis sorted set has logarithmic complexity.

Updating a score in the sorted set is done using ZINCRBY. This command works regardless of whether the to-be-updated key already exists in the sorted set. If it exists, its score will be increased by the specified value, and if it does not exist, it will be created with the specified value:

updating a value in a sorted set
1
2
> ZINCRBY highscores 80 max
(integer) 1

In this case the return value 1 indicates that a new key was added to the set and that it didn’t update an existing one.

Querying the entries with the lowest scores from a Redis sorted set is trivial.

The ZRANGE command will query the entries in the sorted set from lowest to highest score. As the entries are already stored in sorted order, this is very efficient.

The following command queries the bottom 3 keys from the sorted set:

querying the sorted set, from lowest to highest
1
2
3
4
ZRANGE highscores 0 2
1) "jan"
2) "willi"
3) "frank"

For querying in reverse order, there is ZREVRANGE. Both commands can be accompanied by the WITHSCORES flag to also return the associated values (i.e. the scores). Here are the top 3 key/value pairs in the sorted set:

querying the sorted set, from highest to lowest and with scores
1
2
3
4
5
6
7
> ZREVRANGE highscores 0 2 WITHSCORES
1) "max"
2) "80"
3) "thomas"
4) "70"
5) "ingo"
6) "60"

For removing an entry from a sorted set there is ZREM:

removing a key from a sorted set
1
2
> ZREM highscores jan
(integer) 1

There are many more specialized Redis commands for working with sorted sets. The Redis commands prefixed with a Z are sorted set commands.

A highscore list in ArangoDB

Now let’s try to mimic that with ArangoDB.

In ArangoDB, there is no such thing as a sorted set and no direct equivalent. Instead, data in ArangoDB are stored in collections. Collections are a general-purpose storage mechanism and they are not limited to storing just scores.

We also need a mechanism for keeping highscores sorted. By default, no specific sort order is maintained for data in a collection. To have the collection entries sorted by highscore values, we have to explicitly create a (sorted) skiplist index on some attribute. We’ll use an attribute named score for this.

The following shell commands create the collection and the index on score:

creating the highscores collection in ArangoDB
1
2
db._create("highscores");
db.highscores.ensureIndex({ type: "skiplist", fields: [ "score" ] });

Once the collection is set up, we can switch to AQL for the following operations (though we could achieve the same with Shell commands).

To insert the same initial data as in the Redis case, we can run the following five AQL queries:

inserting initial scores
1
2
3
4
5
INSERT { _key: "frank", score: 50 } IN highscores
INSERT { _key: "jan", score: 20 } IN highscores
INSERT { _key: "willi", score: 35 } IN highscores
INSERT { _key: "thomas", score: 75 } IN highscores
INSERT { _key: "ingo", score: 60 } IN highscores

Note that I have been using the _key attribute for saving the user id. Using the _key attribute is normally beneficial because it is the collection’s primary key. It is always present and automatically unique, so exactly what we need for maintaining a highscore list. Note that there are some restrictions for what can be stored inside the _key attribute, but as long as values are only ASCII letters or digits, there is nothing to worry about.

Inserting into the collection will also automatically populate the indexes. Inserting into a skiplist should have about logarithmic complexity on average (though this is not guaranteed – this is because the skiplist is a probabilistic data structure and internally it will be flipping coins. In theory there is a chance that it becomes badly balanced. But in practice it should be quite close to an average logarithmic complexity).

As we have some initial documents, we can now query the lowest and highest scores. This will also be efficient as the queries will use the sorted index on score:

querying the users with lowest scores
1
2
3
4
FOR h IN highscores 
  SORT h.score ASC 
  LIMIT 3 
  RETURN { user: h._key, score: h.score }
querying the users with highest scores
1
2
3
4
FOR h IN highscores 
  SORT h.score DESC 
  LIMIT 3 
  RETURN { user: h._key, score: h.score }

To store a highscore for a user without knowing in advance whether a value has already been stored before for this user, one can use UPSERT. The UPSERT will either insert a new highscore entry, or update an existing one if already present:

using UPSERT
1
2
3
4
UPSERT { _key: "max" } 
  INSERT { _key: "max", score: 80 } 
  UPDATE { score: OLD.score + 80 } IN highscores 
  RETURN { user: NEW._key, score: NEW.score }

If there is already an entry with a key max, its scores will be increased by 80. If such entry does not exist, it will be created. In both cases, the new score will be returned.

Note: the UPSERT command has been added in ArangoDB version 2.6.

Finally, removing an entry from a highscore list is a straight-forward remove operation:

removing an element
1
REMOVE { _key: "jan" } IN highscores

Extensions

We’ll now build on this simple example and create slightly more advanced highscore list use cases. The following topics will be covered:

  • multi-game highscore lists
  • joining data
  • maintaining a “last updated” date

Multi-game highscore lists

We’ll start with generalizing the single-game highscore list into a multi-game highscore list.

In Redis, one would create multiple sorted sets for handling the highscore lists of multiple games. Multiple Redis sorted sets are stored under different keys, so they are isolated from each other.

Though Redis provides a few commands to aggregate data from multiple sorted sets (ZUNIONSTORE and ZINTERSTORE) into a new sorted set, other cross-set operations are not supported. This is not a problem if the client application does not have to perform cross-set queries or cleanup tasks.

In ArangoDB, multi-game highscore lists can be implemented in two variants.
In order to decide which variant is better suited, we need to be clear about whether all highscores should be stored in the same collection or if we prefer using multiple collections (e.g. one per game).

Storing highscores for different games in separate collections has the advantage that they’re really isolated. It is easy to get rid of a specific highscore list by simply dropping its collection. It is also easy to get right query-wise.

All that needs to be changed to turn the above examples into a multi-game highscore list solution is to change the hard-coded collection name highscores and make it a bind parameter, so the right collection name can be injected by the client application easily.

On the downside, the multi-collection solution will make cross-game operations difficult. Additionally, having one collection per game may get out of hand when there are many, many highscore lists to maintain. In case there are many but small highscore lists to maintain, it might be better to put them into a single collection and add a game attribute to tell the individual lists apart in it.

Let’s focus on this and put all highscores of all games into a single collection.

The first adjustment that needs to be made is that we cannot use _key for user ids anymore. This is because user ids may repeat now (a user may be contained in more than one list). So we will change the design and make the combination of game and user a new unique key:

creating a multi-game highscore collection
1
2
3
4
db._drop("highscores");
db._create("highscores");
db.highscores.ensureIndex({ type: "hash", unique: true, fields: [ "user", "game" ] });
db.highscores.ensureIndex({ type: "skiplist", fields: [ "game", "score" ] });

We can use the unique hash index on user and game to ensure there is at most one entry for per user per game. It also allows use to find out quickly whether we already have an entry for that particular combination of game and user. Because we are not using _key we could now also switch to numeric ids if we preferred that.

The other index on game and score is sorted. It can be used to quickly retrieve the leaderboard for a given game. As it is primarily sorted by game, it can also be used to enumerate all entries for a given game.

The following Shell command populates the multi-game highscores collection with 55,000 highscores:

populating the multi-game collection
1
2
3
4
5
6
7
8
9
for (var game = 0; game < 10; ++game) {
  for (var user = 0; user < (game + 1) * 1000; ++user) {
    db.highscores.save({
      game: game,
      user: String(user),
      score: (game + user) % 997  /* arbitrary score */
    });
  }
}

The game ids used above are between 0 and 9, though any other game ids would work, too. User ids are stringified numbers.

We can now find out the leaderboard for game 2 with the following adjusted AQL query. The query will use the (sorted) skiplist index:

querying the leaderboard of a specific game
1
2
3
4
5
FOR h IN highscores 
  FILTER h.game == 2 
  SORT h.score DESC 
  LIMIT 3 
  RETURN { user: h.user, score: h.score }

Removing all scores for a specific game is also efficient due to the the same index:

removing all scores for game 5
1
2
3
FOR h IN highscores
  FILTER h.game == 5
  REMOVE h IN highscores

On a side note: when storing all highscores in the same collection, we could also run cross-game queries if we wanted to. All that needs to be done for this is adjusting the FILTER conditions in the queries.

Inserting or updating a user score can be achieved using an UPSERT. Here’s a query to increase the score of user "1571" in game 2 by a value of 5:

updating a score for a specific user/game combination
1
2
3
4
UPSERT { game: 2, user: "1571" } 
  INSERT { game: 2, user: "1571", score: 5 } 
  UPDATE { score: OLD.score + 5 } IN highscores 
  RETURN { user: NEW._key, score: NEW.score }

The same index on [ "user", "game" ] is used in the following query that will delete the highscore of a given user in a specific game:

removing a score for a specific user/game combination
1
2
3
4
FOR h IN highscores 
  FILTER h.game == 6 
  FILTER h.user == '3894' 
  REMOVE h IN highscores

Joining data

Querying the leaderboard for a specific game was easy. However, so far we have only queried user ids and associated scores in games. In reality, we probably want to display some more user information in a leaderboard, for example their screen names.

In Redis, no extra information can be stored in sorted sets. So extra user information must be stored under separate keys. There is no concept of joins in Redis. The scores contained in the sorted set need to be queried by the client application, and extra user information have to be queried by the client application separately.

In ArangoDB, we could store the screen names in the highscores collection along with the highscores so we can easily query them with the leaderboard query. This is also how it would be done in MongoDB due to the absence of joins there.

While this would work, it will create lots of redundant data if the screen names are also used and stored elsewhere.

So let’s pick the option that stores highscores and screen names in separate places, and brings them together only when needed in a leaderboard query.

Let’s store screen names in a collection named users. The following Shell commands will create the collection and set up 100K users with dummy screen names:

creating test users
1
2
3
4
5
6
7
db._create("users");
for (var i = 0; i < 100000; ++i) {
  db.users.insert({
    _key: String(i),
    name: "test user #" + i
  });
}

We can now query the highscores plus the screen name in one go:

joining highscores with user data
1
2
3
4
5
6
7
FOR h IN highscores 
  FILTER h.game == 2 
  SORT h.score DESC 
  LIMIT 3 
  FOR u IN users 
    FILTER h.user == u._key 
    RETURN { user: h.user, name: u.name, score: h.score } 

Maintaining a “last updated” date

Finally, let’s try to keep track of when a highscore was last updated. There are a few use cases for this, for example displaying the date and time of when a highscore was achieved or for revmoing older highscores.

In Redis, the sorted set values are just the numeric scores, so we cannot store anything else (such as a date) inside the sorted sets. We would really need to store the update date for each highscore entry outside the sorted set, either under a separate key, or using a Redis hash. However, this is complex to manage and keep consistent so we won’t do it.

For implementing the automatic expiration, it would be good if we could use the built-in automatic key expiration of Redis. Each key can optionally be given a time-to-live or an expiration date, and it will automatically expire and vanish then without further ado. This may be exactly what we need to remove older highscore entries, but we cannot use it. The reason is that expiration only works for keys at the top level, but not for individual keys inside a sorted set. So we cannot really implement this sanely.

Let’s switch to ArangoDB now. Here we work with arbitrarily structured documents. That means we can store any other attributes along with a highscore. We can store the timestamp of when a highscore was last set or updated in an attribute named date:

storing the date of last update
1
2
3
4
5
LET now = DATE_NOW()
UPSERT { game: 2, user: "1571" } 
  INSERT { game: 2, user: "1571", score: 10, date: now } 
  UPDATE { score: OLD.score + 10, date: now } IN highscores 
  RETURN { user: NEW._key, score: NEW.score }

The date attribute can now be used for display purposes already.

We can also use the date attribute for identifying the oldest entries in the highscore list in case we want the list to be periodically cleaned up.

Obviously we will be indexing date for this, but we need to decide whether we want to use the same expiration periods for all games, or if we want to use game-specific expirations.

If the expiration date is the same for all games, then we can index just date:

creating the index on date
1
db.highscores.ensureIndex({ type: "skiplist", fields: [ "date" ] });

If we now want to remove entries older than roughly 2 days, regardless of the associated game, the removal query looks like this:

deleting oldest entries
1
2
3
4
5
LET compare = DATE_NOW() - 2 * 86400 * 1000
FOR h IN highscores
  FILTER h.date < compare
  LIMIT 1000
  REMOVE h IN highscores

If we instead want to find (and remove) the oldest entries for individual games, we need to create the index on game and date:

creating the index on game and date
1
db.highscores.ensureIndex({ type: "skiplist", fields: [ "game", "date" ] });

This index allows to efficiently get rid of the oldest entries per game:

remvoin oldest entries for a game
1
2
3
4
5
6
LET compare = DATE_NOW() - 2 * 86400 * 1000
FOR h IN highscores
  FILTER h.game == 2
  FILTER h.date < compare
  LIMIT 1000
  REMOVE h IN highscores

On a side note: the REMOVE was limited to the oldest 1000 entries. This was done to make the query return fast. The removal query can be repeated while there are still entries to remove.