J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Parsing PHP Arrays With PHP

By accident I found this StackOverflow question about how to convert a PHP string with array data into an actual PHP array variable.

For example, if your application gets this string from somewhere:

example string data
1
$string = "array(array('aaa','bbb','ccc','ddd'),array('AAA','BBB','CCC','DDD'))";

How do you convert this into a PHP array variable so you can access the individual array elements? This is what we want to be able to do:

1
2
$result = magicallyConvertStringToArray($string);
var_dump($result[0][0]);  // should be 'aaa'

How do we get to our variable?

The obvious solution #1 is to agree on another data exchange format (e.g. JSON) and simply use that. PHP has built-in functions for JSON stringification and JSON parsing.

Eval?

But what if the data format really has to stay like this and you cannot change it? Then the obvious simple solution would be to eval() the string and capture the result in a new variable.

Voila le array:

using eval
1
2
$result = eval($string);
var_dump($result[0][0]);  // 'aaa'

But everyone knows that eval is evil and should be avoided wherever possible – especially when being run on strings fetched from remote data sources.

Writing a PHP data parser in PHP

Remembering that PHP has a built-in tokenizer for PHP code, we could also make use of this and write a small parser for PHP array data. Note that I wouldn’t recommend writing your own parser if there are other options. But it’s a last resort, and for the task at hand it should be relatively easy.

This is because we’ll only have to deal with arbitrarily nested arrays and some scalar value types (strings, numbers, bool, null). We don’t expect to see serialized object instances in our data. And, not to forget, PHP comes with a built-in tokenizer for PHP code, and we’ll let it do most of the work.

Before the string can be parsed, it must be turned into PHP code. This can be achieved by prepending <?php to it (otherwise the tokenizer would interpret the string as an HTML string). We can then use PHP’s token_get_all() function to tokenize the string contents for us.

We can immediately remove all T_WHITESPACE tokens from the list of tokens, because whitespace is irrelevant for our parsing. For easier handling of tokens, we let a class Tokens handle the tokens. This class provides functions for matching, consuming and peeking into tokens:

class for managing the tokens
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// class to manage tokens
class Tokens {
  private $tokens;

  public function __construct ($code) {
    // construct PHP code from string and tokenize it
    $tokens = token_get_all("<?php " . $code);
    // kick out whitespace tokens
    $this->tokens = array_filter($tokens, function ($token) {
      return (! is_array($token) || $token[0] !== T_WHITESPACE);
    });
    // remove start token (<?php)
    $this->pop();
  }

  public function done () {
    return count($this->tokens) === 0;
  }

  public function pop () {
    // consume the token and return it
    if ($this->done()) {
      throw new Exception("already at end of tokens!");
    }
    return array_shift($this->tokens);
  }

  public function peek () {
    // return next token, don't consume it
    if ($this->done()) {
      throw new Exception("already at end of tokens!");
    }
    return $this->tokens[0];
  }

  public function doesMatch ($what) {
    $token = $this->peek();

    if (is_string($what) && ! is_array($token) && $token === $what) {
      return true;
    }
    if (is_int($what) && is_array($token) && $token[0] === $what) {
      return true;
    }
    return false;
  }

  public function forceMatch ($what) {
    if (! $this->doesMatch($what)) {
      if (is_int($what)) {
        throw new Exception("unexpected token - expecting " . token_name($what));
      }
      throw new Exception("unexpected token - expecting " . $what);
    }
    // consume the token
    $this->pop();
  }
}

With all the tokenization being done, we need a parser that understands the meaning of the individual tokens and puts them together in a meaningful way. Here’s a parser class that can handle simple PHP arrays, string values, int, double and boolean values plus null:

simple parser class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
// parser for simple PHP arrays
class Parser {
  private static $CONSTANTS = array(
    "null" => null,
    "true" => true,
    "false" => false
  );

  private $tokens;

  public function __construct(Tokens $tokens) {
    $this->tokens = $tokens;
  }

  public function parseValue () {
    if ($this->tokens->doesMatch(T_CONSTANT_ENCAPSED_STRING)) {
      // strings
      $token = $this->tokens->pop();
      return stripslashes(substr($token[1], 1, -1));
    }

    if ($this->tokens->doesMatch(T_STRING)) {
      // built-in string literals: null, false, true
      $token = $this->tokens->pop();
      $value = strtolower($token[1]);
      if (array_key_exists($value, self::$CONSTANTS)) {
        return self::$CONSTANTS[$value];
      }
      throw new Exception("unexpected string literal " . $token[1]);
    }

    // the rest...
    // we expect a number here
    $uminus = 1;

    if ($this->tokens->doesMatch("-")) {
      // unary minus
      $this->tokens->forceMatch("-");
      $uminus = -1;
    }

    if ($this->tokens->doesMatch(T_LNUMBER)) {
      // long number
      $value = $this->tokens->pop();
      return $uminus * (int) $value[1];
    }
    if ($this->tokens->doesMatch(T_DNUMBER)) {
      // double number
      $value = $this->tokens->pop();
      return $uminus * (double) $value[1];
    }

    throw new Exception("unexpected value token");
  }

  public function parseArray () {
    $found = 0;
    $result = array();

    $this->tokens->forceMatch(T_ARRAY);
    $this->tokens->forceMatch("(");

    while (true) {
      if ($this->tokens->doesMatch(")")) {
        // reached the end of the array
        $this->tokens->forceMatch(")");
        break;
      }

      if ($found > 0) {
        // we must see a comma following the first element
        $this->tokens->forceMatch(",");
      }

      if ($this->tokens->doesMatch(T_ARRAY)) {
        // nested array
        $result[] = $this->parseArray();
      }
      else if ($this->tokens->doesMatch(T_CONSTANT_ENCAPSED_STRING)) {
        // string
        $string = $this->parseValue();
        if ($this->tokens->doesMatch(T_DOUBLE_ARROW)) {
          // array key (key => value)
          $this->tokens->pop();
          $result[$string] = $this->parseValue();
        }
        else {
          // simple string
          $result[] = $string;
        }
      }
      else {
        $result[] = $this->parseValue();
      }

      ++$found;
    }
    return $result;
  }
}

And finally we need some code to invoke the parser:

parser invokation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// here's our test string (with intentionally wild usage of whitespace)
$string = " array (\"test\" => \"someValue\", 
  array\n('aaa', 'bbb', 'ccc', array('ddd')), 
array('AAA', 'BBB','CCC','DDD', null,1, 2, 3,-4, -42.99, -4e32, true, false))";

$tokens = new Tokens($string);
$parser = new Parser($tokens);
$result = $parser->parseArray();

// check if the parser matched the whole string or if there's something left at the end
if (! $tokens->done()) {
  throw new Exception("still tokens left after parsing");
}

var_dump("RESULT: ", $result);

This will give us the data in a ready-to-use PHP variable $result, with all the nested data structures being built correctly.

A few things to note:

  • Parsing PHP data with PHP is quite easy because PHP already comes with a tokenizer for PHP. Parsing a different language with PHP is quite harder, as we would have to write a language-specific tokenizer first!

  • The above code was quickly put together for demonstration purposes. I am pretty sure it will not cover all cases. Apart from that, it was written to be intuitive and not to be efficient (i.e. instead modifying the tokens array in place with array_shift(), we would rather leave that array constant and work with an index into it).

  • For grammars more complex than this simple one, don’t go with hand-written parsers but use a parser generator. I am not sure what parser generators are available in the PHP world, but in C and C++ most people will go with GNU Bison and Flex.

  • Writing your own parsers is error-prone even with a parser generator, so don’t do it if you don’t have to. If you can, use a widely supported data format such as JSON instead and let json_decode() do all the heavy lifting for you.