A few weeks ago I wrote about ArangoDB’s specialized export API.
The export API is useful when the goal is to extract all documents from a given collection and to process them outside of ArangoDB.
The export API can provide quick and memory-efficient snapshots of the data in the underlying collection, making it suitable for extract all documents of the collection. It will be able to provide data much faster than with an AQL query that will extract all documents.
In this post I’ll show how to use the export API to extract data and process it with PHP.
A prerequiste for using the export API is using an ArangoDB server with version 2.6
or higher. As there hasn’t been an official 2.6 release yet, this currently requires
building the devel
branch of ArangoDB from source. When there is a regular 2.6
release, this should be used instead.
Importing example data
First we need some data in an ArangoDB collection that we can process externally.
For the following examples, I’ll use a collection named users
which I’ll populate
with 100k example documents. Here’s how
to get this data into ArangoDB:
1 2 3 4 5 6 |
|
There should now be 100K documents present in a collection named users
. You can
quickly verify that by peeking into the collection using the web interface.
Setting up ArangoDB-PHP
An easy way of trying the export API is to use it from PHP. We therefore clone the devel branch of the arangodb-php Github repository into a local directory:
1
|
|
Note: when there is an official 2.6 release, the 2.6
branch of arangodb-php should
be used instead of the devel
branch.
We now write a simple PHP script that establishes a connection to the ArangoDB server running on localhost. We’ll extend that file gradually. Here’s a skeleton file to start with. The code can be downloaded here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
Running that script should simply print Connected!
. This means the PHP script
can connect to ArangoDB and we can go on.
Extracting the data
With a working database connection we can now start with the actual processing.
In place of the TODO
in the skeleton file, we can actually run an export of
the data in collection users
. The following simple function extracts all
documents from the collection and writes them to an output file output.json
in JSON format.
It will also print some statistics about the number of documents and the total data size. The full script can be downloaded here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
Running this version of the script should print something similar to the following
and also produce a file named output.json
. Each line in the file should be a JSON
object representing a document in the collection.
1
|
|
Applying some transformations
We now use PHP to transform data as we extract it. With an example script, we’ll apply the following transformations on the data:
- rewrite the contents of the
gender
attribute:female
should becomef
male
should becomem
- rename attribute
birthday
todob
- change date formats in
dob
andmemberSince
from YYYY-MM-DD to MM/DD/YYYY - concatenate the contents of the
name.first
andname.last
subattributes - transform array in
contact.email
into a flat string - remove all other attributes
Here’s a transformation function that does this, and a slightly simplified export function. This version of the script can also be downloaded here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
The adjusted version of the PHP script will now produce an output file named
output-transformed.json
.
Filtering attributes
In the last example we discarded a few attributes of each document. Instead of filtering out these attributes with PHP, we can configure the export to already exclude these attributes server-side. This way we can save some traffic.
Here’s an adjusted configuration that will exclude the unneeded attributes _id
,
_rev
, _key
and likes
:
1 2 3 4 5 6 7 8 9 |
|
The full script that employs the adjusted configuration can be downloaded here.
Instead of excluding specific attributes we can also do it the other way and only
include certain attributes in an export. The following script demonstrates this by
extracting only the _key
and name
attributes of each document. It then prints the
key/name pairs in CSV format.
The full script can be downloaded here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Using the API without PHP
The export API REST interface is simple and it can be used with any client that can speak HTTP. This includes curl obviously:
The following command fetches the initial 5K documents from the users
collection
using curl:
1 2 3 4 |
|
The HTTP response will contain a result
attribute that contains the actual
documents. It will also contain an attribute hasMore
that will indicate whether
there are more documents for the client to fetch. If it is set to true
, the
HTTP response will also contain an attribute id
. The client can use this id
for sending follow-up requests like this (assuming the returned id was 13979338067709
):
1 2 3 |
|
That’s about it. Using the export API it should be fairly simple to ship bulk ArangoDB data to client applications or data processing tools.