How do you convert this into a PHP array variable so you can access the individual
array elements? This is what we want to be able to do:
12
$result=magicallyConvertStringToArray($string);var_dump($result[0][0]);// should be 'aaa'
How do we get to our variable?
The obvious solution #1 is to agree on another data exchange format (e.g. JSON)
and simply use that. PHP has built-in functions for JSON stringification
and JSON parsing.
Eval?
But what if the data format really has to stay like this and you cannot change it?
Then the obvious simple solution would be to eval() the string and capture the result
in a new variable.
But everyone knows that eval is evil and should be avoided wherever possible – especially when
being run on strings fetched from remote data sources.
Writing a PHP data parser in PHP
Remembering that PHP has a built-in tokenizer for PHP code, we could also make use of
this and write a small parser for PHP array data.
Note that I wouldn’t recommend writing your own parser if there are other options. But it’s
a last resort, and for the task at hand it should be relatively easy.
This is because we’ll only have to deal with arbitrarily nested arrays and some scalar value
types (strings, numbers, bool, null). We don’t expect to see serialized object instances in our
data. And, not to forget, PHP comes with a built-in tokenizer for PHP code, and we’ll let
it do most of the work.
Before the string can be parsed, it must be turned into PHP code. This can be achieved
by prepending <?php to it (otherwise the tokenizer would interpret the string as an HTML
string). We can then use PHP’s token_get_all() function to tokenize the string contents for us.
We can immediately remove all T_WHITESPACE tokens from the list of tokens, because whitespace
is irrelevant for our parsing. For easier handling of tokens, we let a class Tokens handle
the tokens. This class provides functions for matching, consuming and peeking into tokens:
// class to manage tokensclassTokens{private$tokens;publicfunction__construct($code){// construct PHP code from string and tokenize it$tokens=token_get_all("<?php ".$code);// kick out whitespace tokens$this->tokens=array_filter($tokens,function($token){return(!is_array($token)||$token[0]!==T_WHITESPACE);});// remove start token (<?php)$this->pop();}publicfunctiondone(){returncount($this->tokens)===0;}publicfunctionpop(){// consume the token and return itif($this->done()){thrownewException("already at end of tokens!");}returnarray_shift($this->tokens);}publicfunctionpeek(){// return next token, don't consume itif($this->done()){thrownewException("already at end of tokens!");}return$this->tokens[0];}publicfunctiondoesMatch($what){$token=$this->peek();if(is_string($what)&&!is_array($token)&&$token===$what){returntrue;}if(is_int($what)&&is_array($token)&&$token[0]===$what){returntrue;}returnfalse;}publicfunctionforceMatch($what){if(!$this->doesMatch($what)){if(is_int($what)){thrownewException("unexpected token - expecting ".token_name($what));}thrownewException("unexpected token - expecting ".$what);}// consume the token$this->pop();}}
With all the tokenization being done, we need a parser that understands the meaning
of the individual tokens and puts them together in a meaningful way. Here’s a parser
class that can handle simple PHP arrays, string values, int, double and boolean values
plus null:
// parser for simple PHP arraysclassParser{privatestatic$CONSTANTS=array("null"=>null,"true"=>true,"false"=>false);private$tokens;publicfunction__construct(Tokens$tokens){$this->tokens=$tokens;}publicfunctionparseValue(){if($this->tokens->doesMatch(T_CONSTANT_ENCAPSED_STRING)){// strings$token=$this->tokens->pop();returnstripslashes(substr($token[1],1,-1));}if($this->tokens->doesMatch(T_STRING)){// built-in string literals: null, false, true$token=$this->tokens->pop();$value=strtolower($token[1]);if(array_key_exists($value,self::$CONSTANTS)){returnself::$CONSTANTS[$value];}thrownewException("unexpected string literal ".$token[1]);}// the rest...// we expect a number here$uminus=1;if($this->tokens->doesMatch("-")){// unary minus$this->tokens->forceMatch("-");$uminus=-1;}if($this->tokens->doesMatch(T_LNUMBER)){// long number$value=$this->tokens->pop();return$uminus*(int)$value[1];}if($this->tokens->doesMatch(T_DNUMBER)){// double number$value=$this->tokens->pop();return$uminus*(double)$value[1];}thrownewException("unexpected value token");}publicfunctionparseArray(){$found=0;$result=array();$this->tokens->forceMatch(T_ARRAY);$this->tokens->forceMatch("(");while(true){if($this->tokens->doesMatch(")")){// reached the end of the array$this->tokens->forceMatch(")");break;}if($found>0){// we must see a comma following the first element$this->tokens->forceMatch(",");}if($this->tokens->doesMatch(T_ARRAY)){// nested array$result[]=$this->parseArray();}elseif($this->tokens->doesMatch(T_CONSTANT_ENCAPSED_STRING)){// string$string=$this->parseValue();if($this->tokens->doesMatch(T_DOUBLE_ARROW)){// array key (key => value)$this->tokens->pop();$result[$string]=$this->parseValue();}else{// simple string$result[]=$string;}}else{$result[]=$this->parseValue();}++$found;}return$result;}}
And finally we need some code to invoke the parser:
parser invokation
123456789101112131415
// here's our test string (with intentionally wild usage of whitespace)$string=" array (\"test\" => \"someValue\", array\n('aaa', 'bbb', 'ccc', array('ddd')), array('AAA', 'BBB','CCC','DDD', null,1, 2, 3,-4, -42.99, -4e32, true, false))";$tokens=newTokens($string);$parser=newParser($tokens);$result=$parser->parseArray();// check if the parser matched the whole string or if there's something left at the endif(!$tokens->done()){thrownewException("still tokens left after parsing");}var_dump("RESULT: ",$result);
This will give us the data in a ready-to-use PHP variable $result, with all the
nested data structures being built correctly.
A few things to note:
Parsing PHP data with PHP is quite easy because PHP already comes with a tokenizer
for PHP. Parsing a different language with PHP is quite harder, as we would have to
write a language-specific tokenizer first!
The above code was quickly put together for demonstration purposes. I am pretty sure
it will not cover all cases. Apart from that, it was written to be intuitive and not
to be efficient (i.e. instead modifying the tokens array in place with array_shift(),
we would rather leave that array constant and work with an index into it).
For grammars more complex than this simple one, don’t go with hand-written parsers
but use a parser generator. I am not sure what parser generators are available in the
PHP world, but in C and C++ most people will go with GNU Bison
and Flex.
Writing your own parsers is error-prone even with a parser generator, so don’t do
it if you don’t have to. If you can, use a widely supported data format such as JSON
instead and let json_decode() do all the heavy lifting for you.