Today I was asked to look at code that was supposed to read data from a MySQL data source, process it and then import it into ArangoDB.
To run and debug the code I had to have some MySQL data source. So I thought I’d quickly set up a simple example table with a few rows. It turned out that this took more time than what I had expected.
I worked with MySQL databases in production for 10+ years and spent much time working with the mysql client. I always liked MySQL, but in the past few years, I was driven away from it and lost contact. Instead, I got sucked into the NoSQL landscape and enjoy it pretty much.
Getting back to the original problem: I needed some MySQL table with a few thousand rows for a test. It turned out I didn’t even have MySQL installed on my computer, so I needed to install it first.
After setting up the MySQL server, I created a table
storing my test data:
1 2 3
Not really the black belt of schema design, but good enough for a quick test.
Now the table needed some rows. 100,000 rows should be enough. I wrote some bash script to create them as there is no sane way to do this with the MySQL client alone:
1 2 3 4
Time to import the data!
At first I was a bit surprised this command did not return instantly. I let it run for about a minute, and then began checking the import progress with a second mysql client. It turned out only very few records had been imported, and the import script continued to create only around 30-35 records per second.
Seems I had forgotten that I am working with a No-NoSQL database, with full
ACID semantics for everything. My import file contained 100,000
statements, so I was asking to perform 100,000 transactions and fsync operations.
That import would have taken forever with my slow HDD!
I quickly changed the InnoDB setting to make it commit only about once per second:
Now the import finished in 7 seconds.
I finally got the data in MySQL, but overall it took me about 10 minutes to get it done. Probably a bit less if I still were an active user of MySQL and had remembered the default behavior right from the start.
Still, my feeling is that it takes too much time to get something so simple done.
I don’t blame the database for trying to commit all 100,000 single-row
INSERT operations and fsync them to disk. It simply cannot know if the data
are important or just throw-away test records.
But there are other reasons: I had to write a bash script to produce the test data, as there is no sane way to do this with the MySQL client alone. Writing bash scripts is fine, and in general I like it, but I don’t want to do it for a dead-simple test setup.
And by the way, what if it turns out that I need to generate slightly more complex test data? In the MySQL case I probably would have resorted to sed or awk or would have thrown away my bash script and had rewritten it in some other language. So I would have wasted even more time.
For example, here’s how the test setup would look like in the ArangoShell:
1 2 3 4
I find this much easier to use: it allows to do everything in one place, removing the need to write another script that prepares a data file or an SQL file first.
Even more annoying to me is that I needed to provide a schema for the
table first. I could have got away with declaring all text fields as
TEXT so I can at least ignore string
length restrictions. But I still need to type in the table schema
once, even if it feels completely useless for this particular use case.
It would get even more annoying if during my test I noticed I needed more
or other columns. Then I would need to adjust the table schema using
or adjust the
CREATE TABLE statement and run it again, keeping me
away from the original task.
Maybe using schema-free databases for too long has spoilt me, but I much more prefer starting quickly and without a schema. I know the data that I am going to load will have a structure and will be somewhat self-describing, so the database can still figure out what the individual parts of a record are.
On a side-note: should you be a fan of using query languages, the same test setup can also be achieved by running the following AQL query from the ArangoShell: