{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Setting Up Test Data

Today I was asked to look at code that was supposed to read data from a MySQL data source, process it and then import it into ArangoDB.

To run and debug the code I had to have some MySQL data source. So I thought I’d quickly set up a simple example table with a few rows. It turned out that this took more time than what I had expected.

Maybe I’m spoilt by JavaScript-enabled, schema-free databases where creating such test setups is so much easier.

I worked with MySQL databases in production for 10+ years and spent much time working with the mysql client. I always liked MySQL, but in the past few years, I was driven away from it and lost contact. Instead, I got sucked into the NoSQL landscape and enjoy it pretty much.

Getting back to the original problem: I needed some MySQL table with a few thousand rows for a test. It turned out I didn’t even have MySQL installed on my computer, so I needed to install it first.

After setting up the MySQL server, I created a table examples for storing my test data:

USE test;
CREATE TABLE examples (attribute1 VARCHAR(20), attribute2 VARCHAR(20));

Not really the black belt of schema design, but good enough for a quick test.

Now the table needed some rows. 100,000 rows should be enough. I wrote some bash script to create them as there is no sane way to do this with the MySQL client alone:

for i in `seq 1 100000`
    echo "INSERT INTO examples VALUES (\"test$i\", \"test$i\");" >> import.sql

Time to import the data!

mysql -u user test < import.sql

At first I was a bit surprised this command did not return instantly. I let it run for about a minute, and then began checking the import progress with a second mysql client. It turned out only very few records had been imported, and the import script continued to create only around 30-35 records per second.

Seems I had forgotten that I am working with a No-NoSQL database, with full ACID semantics for everything. My import file contained 100,000 INSERT statements, so I was asking to perform 100,000 transactions and fsync operations. That import would have taken forever with my slow HDD!

I quickly changed the InnoDB setting to make it commit only about once per second:

mysql> SET GLOBAL innodb_flush_log_at_trx_commit = 2;
Query OK, 0 rows affected (0.00 sec)

Now the import finished in 7 seconds.

I finally got the data in MySQL, but overall it took me about 10 minutes to get it done. Probably a bit less if I still were an active user of MySQL and had remembered the default behavior right from the start.

Still, my feeling is that it takes too much time to get something so simple done.

I don’t blame the database for trying to commit all 100,000 single-row INSERT operations and fsync them to disk. It simply cannot know if the data are important or just throw-away test records.

But there are other reasons: I had to write a bash script to produce the test data, as there is no sane way to do this with the MySQL client alone. Writing bash scripts is fine, and in general I like it, but I don’t want to do it for a dead-simple test setup.

And by the way, what if it turns out that I need to generate slightly more complex test data? In the MySQL case I probably would have resorted to sed or awk or would have thrown away my bash script and had rewritten it in some other language. So I would have wasted even more time.

I personally prefer the ability to use a scripting language for such tasks. JavaScript is ubiquituous these days, and I want to use it in a database’s command-line client.

For example, here’s how the test setup would look like in the ArangoShell:

for (i = 0; i < 100000; ++i) {
  db.examples.save({ attribute1: "test" + i, attribute2: "test" + i });

I find this much easier to use: it allows to do everything in one place, removing the need to write another script that prepares a data file or an SQL file first.

As a bonus, using a programming language is much more flexible and powerful. If I needed to generate slightly more complex test data, I can just do it, adjust the JavaScript code and re-run it.

Even more annoying to me is that I needed to provide a schema for the table first. I could have got away with declaring all text fields as VARCHAR(255) or TEXT so I can at least ignore string length restrictions. But I still need to type in the table schema once, even if it feels completely useless for this particular use case.

It would get even more annoying if during my test I noticed I needed more or other columns. Then I would need to adjust the table schema using ALTER TABLE or adjust the CREATE TABLE statement and run it again, keeping me away from the original task.

Maybe using schema-free databases for too long has spoilt me, but I much more prefer starting quickly and without a schema. I know the data that I am going to load will have a structure and will be somewhat self-describing, so the database can still figure out what the individual parts of a record are.

On a side-note: should you be a fan of using query languages, the same test setup can also be achieved by running the following AQL query from the ArangoShell:

db._query("FOR i IN 1..100000 LET value = CONCAT('test', i) " +
          "INSERT { attribute1: value, attribute2: value } INTO examples");