Improved Non-unique Hash Indexes in 2.3

With ArangoDB 2.3 now getting into the beta stage, it’s time to spread the word about new features and improvements.

Today’s post will be about the changes made to non-unique hash indexes.

Hash indexes allow looking up documents quickly if the indexed attributes are all provided in a search query. They are not suitable for range queries, but are the perfect choice if equality comparisons are all that’s needed.

Hash indexes have been available in ArangoDB ever since. There have always been two variants of them:

unique hash indexes
non-unique hash indexes

There wasn’t much to be done for unique hash indexes, and so there haven’t been any changes to them in 2.3. However, the non-unique hash indexes were improved significantly in the new version.

The non-unique indexes already performed quite well if most of the indexed values were unique and only few repetitions occurred. But their performance suffered severly if the indexed attribute values repeated a lot – that is, when the indexed value had a low cardinality and thus the index had a low selectivity.

This was a problem because it slowed down inserting new documents into a collection with such an index. And it also slowed down loading collections with low cardinality hash indexes.

I am happy to state that in ArangoDB 2.3 this has been fixed, and the insert performance of non-unique hash indexes has been improved significantly. The index insertion time now scales quite well with the number of indexed documents regardless of the cardinality of the indexed attribute.

Following are a few measurements of non-unique hash index insertion times from ArangoDB 2.3, for different cardinalities of the indexed attribute.

The times reported are the net non-unique hash index insertion times (the documents were present already, just the index was created on them and index creation time was measured).

Let’s start with a not too challenging case: indexing documents in a collection with 100,000 different index values (cardinality 100,000):

index insertion times for cardinality 100,000

number of documents:    128,000    =>    time:   0.144 s
number of documents:    256,000    =>    time:   0.231 s
number of documents:    512,000    =>    time:   0.347 s
number of documents:  1,024,000    =>    time:   0.694 s
number of documents:  2,048,000    =>    time:   1.379 s

The picture doesn’t change much when reducing the cardinality by a factor or 10 (i.e. cardinality 10,000):

index insertion times for cardinality 10,000

number of documents:    128,000    =>    time:   0.169 s
number of documents:    256,000    =>    time:   0.194 s
number of documents:    512,000    =>    time:   0.355 s
number of documents:  1,024,000    =>    time:   0.668 s
number of documents:  2,048,000    =>    time:   1.325 s

Let’s again divide cardinality by 10 (now cardinality 1,000):

index insertion times for cardinality 1,000

number of documents:    128,000    =>    time:   0.130 s
number of documents:    256,000    =>    time:   0.152 s
number of documents:    512,000    =>    time:   0.261 s
number of documents:  1,024,000    =>    time:   0.524 s
number of documents:  2,048,000    =>    time:   0.934 s

Cardinality 100:

index insertion times for cardinality 100

number of documents:    128,000    =>    time:   0.114 s
number of documents:    256,000    =>    time:   0.148 s
number of documents:    512,000    =>    time:   0.337 s
number of documents:  1,024,000    =>    time:   0.452 s
number of documents:  2,048,000    =>    time:   0.907 s

Cardinality 10:

index insertion times for cardinality 10

number of documents:    128,000    =>    time:   0.130 s
number of documents:    256,000    =>    time:   0.327 s
number of documents:    512,000    =>    time:   0.239 s
number of documents:  1,024,000    =>    time:   0.442 s
number of documents:  2,048,000    =>    time:   0.827 s

Finally we get to cardinality 1, the definitive indicator for the index being absolutely useless. Let’s create it anyway, for the sake of completeness of this post:

index insertion times for cardinality 1

number of documents:    128,000    =>    time:   0.130 s
number of documents:    128,000    =>    time:   0.095 s
number of documents:    256,000    =>    time:   0.146 s
number of documents:    512,000    =>    time:   0.246 s
number of documents:  1,024,000    =>    time:   0.445 s
number of documents:  2,048,000    =>    time:   0.925 s

On a side note: all indexed values were numeric. In absolute terms, indexing string values will be slower than indexing numbers, but insertion should still scale nicely with the number of documents as long as everything fits in RAM.

J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Improved Non-unique Hash Indexes in 2.3