Exporting Data for Offline Processing

A few weeks ago I wrote about ArangoDB’s specialized export API.

The export API is useful when the goal is to extract all documents from a given collection and to process them outside of ArangoDB.

The export API can provide quick and memory-efficient snapshots of the data in the underlying collection, making it suitable for extract all documents of the collection. It will be able to provide data much faster than with an AQL query that will extract all documents.

In this post I’ll show how to use the export API to extract data and process it with PHP.

A prerequiste for using the export API is using an ArangoDB server with version 2.6 or higher. As there hasn’t been an official 2.6 release yet, this currently requires building the devel branch of ArangoDB from source. When there is a regular 2.6 release, this should be used instead.

Importing example data

First we need some data in an ArangoDB collection that we can process externally.

For the following examples, I’ll use a collection named users which I’ll populate with 100k example documents. Here’s how to get this data into ArangoDB:

commands for fetching and importing data

# download data file
wget https://jsteemann.github.io/downloads/code/users-100000.json.tar.gz
# uncompress it
tar xvfz users-100000.json.tar.gz
# import into ArangoDB 
arangoimp --file users-100000.json --collection users --create-collection true

There should now be 100K documents present in a collection named users. You can quickly verify that by peeking into the collection using the web interface.

Setting up ArangoDB-PHP

An easy way of trying the export API is to use it from PHP. We therefore clone the devel branch of the arangodb-php Github repository into a local directory:

cloning arangodb-php

git clone -b devel "https://github.com/arangodb/arangodb-php.git"

Note: when there is an official 2.6 release, the 2.6 branch of arangodb-php should be used instead of the devel branch.

We now write a simple PHP script that establishes a connection to the ArangoDB server running on localhost. We’ll extend that file gradually. Here’s a skeleton file to start with. The code can be downloaded here:

skeleton file for establishing a connection

<?php

namespace triagens\ArangoDb;

// use the driver's autoloader to load classes
require 'arangodb-php/autoload.php';
Autoloader::init();

// set up connection options
$connectionOptions = array(
  // endpoint to connect to
  ConnectionOptions::OPTION_ENDPOINT     => 'tcp://localhost:8529',
  // can use Keep-Alive connection
  ConnectionOptions::OPTION_CONNECTION   => 'Keep-Alive',
  // use basic authorization
  ConnectionOptions::OPTION_AUTH_TYPE    => 'Basic',
  // user for basic authorization
  ConnectionOptions::OPTION_AUTH_USER    => 'root',
  // password for basic authorization
  ConnectionOptions::OPTION_AUTH_PASSWD  => '',
  // timeout in seconds
  ConnectionOptions::OPTION_TIMEOUT      => 30,
  // database name 
  ConnectionOptions::OPTION_DATABASE     => '_system'
);

try {
  // establish connection
  $connection = new Connection($connectionOptions);

  echo 'Connected!' . PHP_EOL;

  // TODO: now do something useful with the connection!

} catch (ConnectException $e) {
  print $e . PHP_EOL;
} catch (ServerException $e) {
  print $e . PHP_EOL;
} catch (ClientException $e) {
  print $e . PHP_EOL;
}

Running that script should simply print Connected!. This means the PHP script can connect to ArangoDB and we can go on.

Extracting the data

With a working database connection we can now start with the actual processing. In place of the TODO in the skeleton file, we can actually run an export of the data in collection users. The following simple function extracts all documents from the collection and writes them to an output file output.json in JSON format.

It will also print some statistics about the number of documents and the total data size. The full script can be downloaded here:

exporting data into a file

function export($collection, Connection $connection) {
  $fp = fopen('output.json', 'w');

  if (! $fp) {
    throw new Exception('could not open output file!');
  }

  // settings to use for the export
  $settings = array(
    'batchSize' => 5000,  // export in chunks of 5K documents
    '_flat' => true       // use simple PHP arrays
  );

  $export = new Export($connection, $collection, $settings);

  // execute the export. this will return an export cursor
  $cursor = $export->execute();

  // statistics 
  $count   = 0;
  $batches = 0;
  $bytes   = 0;

  // now we can fetch the documents from the collection in batches
  while ($docs = $cursor->getNextBatch()) {
    $output = '';
    foreach ($docs as $doc) {
      $output .= json_encode($doc) . PHP_EOL;
    }

    // write out chunk
    fwrite($fp, $output);

    // update statistics
    $count += count($docs);
    $bytes += strlen($output);
    ++$batches;
  }

  fclose($fp);

  echo sprintf('written %d documents in %d batches with %d total bytes',
               $count,
               $batches,
               $bytes) . PHP_EOL;
}

// run the export
export('users', $connection);

Running this version of the script should print something similar to the following and also produce a file named output.json. Each line in the file should be a JSON object representing a document in the collection.

script output

written 100000 documents in 20 batches with 40890013 total bytes

Applying some transformations

We now use PHP to transform data as we extract it. With an example script, we’ll apply the following transformations on the data:

rewrite the contents of the gender attribute:
- female should become f
- male should become m
rename attribute birthday to dob
change date formats in dob and memberSince from YYYY-MM-DD to MM/DD/YYYY
concatenate the contents of the name.first and name.last subattributes
transform array in contact.email into a flat string
remove all other attributes

Here’s a transformation function that does this, and a slightly simplified export function. This version of the script can also be downloaded here:

transformation and export functions

function transformDate($value) {
  return preg_replace('/^(\\d+)-(\\d+)-(\\d+)$/', '\\2/\\3/\\1', $value);
}

function transform(array $document) {
  static $genders = array('male' => 'm', 'female' => 'f');

  $transformed = array(
    'gender'      => $genders[$document['gender']],
    'dob'         => transformDate($document['birthday']),
    'memberSince' => transformDate($document['memberSince']),
    'fullName'    => $document['name']['first'] . ' ' . $document['name']['last'],
    'email'       => $document['contact']['email'][0]
  );

  return $transformed;
}

function export($collection, Connection $connection) {
  $fp = fopen('output-transformed.json', 'w');

  if (! $fp) {
    throw new Exception('could not open output file!');
  }

  // settings to use for the export
  $settings = array(
    'batchSize' => 5000,  // export in chunks of 5K documents
    '_flat' => true       // use simple PHP arrays
  );

  $export = new Export($connection, $collection, $settings);

  // execute the export. this will return an export cursor
  $cursor = $export->execute();

  // now we can fetch the documents from the collection in batches
  while ($docs = $cursor->getNextBatch()) {
    $output = '';
    foreach ($docs as $doc) {
      $output .= json_encode(transform($doc)) . PHP_EOL;
    }

    // write out chunk
    fwrite($fp, $output);
  }

  fclose($fp);
}

// run the export
export('users', $connection);

The adjusted version of the PHP script will now produce an output file named output-transformed.json.

Filtering attributes

In the last example we discarded a few attributes of each document. Instead of filtering out these attributes with PHP, we can configure the export to already exclude these attributes server-side. This way we can save some traffic.

Here’s an adjusted configuration that will exclude the unneeded attributes _id, _rev, _key and likes:

configuration for attribute exclusion

// settings to use for the export
$settings = array(
  'batchSize' => 5000,  // export in chunks of 5K documents
  '_flat' => true,      // use simple PHP arrays
  'restrict' => array(
    'type' => 'exclude',
    'fields' => array('_id', '_rev', '_key', 'likes')
  )
);

The full script that employs the adjusted configuration can be downloaded here.

Instead of excluding specific attributes we can also do it the other way and only include certain attributes in an export. The following script demonstrates this by extracting only the _key and name attributes of each document. It then prints the key/name pairs in CSV format.

The full script can be downloaded here.

export function that prints key/name pairs in CSV format

function export($collection, Connection $connection) {
  // settings to use for the export
  $settings = array(
    'batchSize' => 5000,  // export in chunks of 5K documents
    '_flat' => true,      // use simple PHP arrays
    'restrict' => array(
      'type' => 'include',
      'fields' => array('_key', 'name')
    )
  );

  $export = new Export($connection, $collection, $settings);

  // execute the export. this will return an export cursor
  $cursor = $export->execute();

  // now we can fetch the documents from the collection in batches
  while ($docs = $cursor->getNextBatch()) {
    $output = '';

    foreach ($docs as $doc) {
      $values = array(
        $doc['_key'],
        $doc['name']['first'] . ' ' . $doc['name']['last']
      );

      $output .= '"' . implode('","', $values) . '"' . PHP_EOL;
    }

    // print out the data directly 
    print $output;
  }
}

// run the export
export('users', $connection);

Using the API without PHP

The export API REST interface is simple and it can be used with any client that can speak HTTP. This includes curl obviously:

The following command fetches the initial 5K documents from the users collection using curl:

using the export API with curl

curl                                                   \
  -X POST                                              \
  http://localhost:8529/_api/export?collection=users   \
  --data '{"batchSize":5000}'

The HTTP response will contain a result attribute that contains the actual documents. It will also contain an attribute hasMore that will indicate whether there are more documents for the client to fetch. If it is set to true, the HTTP response will also contain an attribute id. The client can use this id for sending follow-up requests like this (assuming the returned id was 13979338067709):

sending a follow-up request with curl

curl                                                   \
  -X PUT                                               \
  http://localhost:8529/_api/export/13979338067709

That’s about it. Using the export API it should be fairly simple to ship bulk ArangoDB data to client applications or data processing tools.

J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }