J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Analyzing Git Commits With ArangoDB

I often find myself searching for certain commits using git log and friends. While I really love the power and flexibility that come with the git and other Unix command-line tools, sometimes it can be more convenient to use a database to filter and aggregate commit data.

I gave it a quick try yesterday and imported the commit history of ArangoDB’s Git repository into ArangoDB and ran some queries on the data. While the query results for our repository may not be interesting for everyone, I think it is still worth sharing what I did. Even though I didn’t try it, I think the overall procedure is applicable with any other Git repository.

Converting the Git history to JSON

The way to extract history and commit data from a local repository is to use git log. Though its output is greatly customizable, it does not provide an out-of-the-box solution for producing JSON. So I wrote a simple wrapper script (in PHP) around it. The script can be found here.

Here’s how to run it:

converting the git history to JSON
1
2
3
cd path/to/local/git-repository
wget https://gist.githubusercontent.com/jsteemann/65ef221646449713b2c5/raw/fef22c729e01dd0777b378ac1e17e951ea47c7dd/git-log-to-json.php
php git-log-to-json.php > arango-commits-master-201503.json

The script may run a few minutes on bigger repositories such as ours. In the end, it should produce a JSON file named arango-commits-master-201503.json.

I have also uploaded the JSON file here. Note that the file only contains commits from the master branch and not all commits done in ArangoDB in total.

Importing the commits into ArangoDB

The simplest way to get the commits into ArangoDB is to use arangoimp:

importing the commits into ArangoDB
1
2
3
4
5
6
arangoimp                                   \
  --collection commits                      \
  --create-collection true                  \
  --file arango-commits-master-201503.json  \
  --overwrite true                          \
  --batch-size 32000000

This should have imported all the commits into a collection commits in the default database.

Querying the commit logs

Following are a few example queries that I ran on the data from the ArangoShell. As mentioned before, it should be possible to run the queries for other repositories’ data.

getting all contributors
1
2
query = 'FOR commit IN commits COLLECT author = commit.author.name RETURN author';
db._query(query).toArray();
retrieving the total number of commits
1
2
query = 'FOR commit IN commits COLLECT WITH COUNT INTO count RETURN count';
db._query(query).toArray();
retrieving the number of commits by contributor
1
2
query = 'FOR commit IN commits COLLECT author = commit.author.name WITH COUNT INTO count RETURN { author: author, count: count }';
db._query(query).toArray();
retrieving the tagged commits
1
2
query = 'FOR commit IN commits FILTER commit.tag != null SORT commit.date RETURN KEEP(commit, [ "date", "message", "tag" ])';
db._query(query).toArray();
retrieving number of commits per year
1
2
query = 'FOR commit IN commits COLLECT year = DATE_YEAR(commit.date) WITH COUNT INTO count RETURN { year: year, count: count }';
db._query(query).toArray();
retrieving number of commits per month / year
1
2
query = 'FOR commit IN commits COLLECT year = DATE_YEAR(commit.date), month = DATE_MONTH(commit.date)  WITH COUNT INTO count RETURN { month: CONCAT(year, "/", month), count: count }';
db._query(query).toArray();
retrieving number of commits per weekday
1
2
query = 'FOR commit IN commits COLLECT day = DATE_DAYOFWEEK(commit.date) WITH COUNT INTO count RETURN { day: TRANSLATE(day, { "0": "Sunday", "1": "Monday", "2": "Tuesday", "3": "Wednesday", "4": "Thursday", "5": "Friday", "6": "Saturday" }), count: count }';
db._query(query).toArray();
retrieving commits with string “issue #” in commit message
1
2
query = 'FOR commit IN commits FILTER LIKE(commit.message, "%issue #%") SORT commit.date DESC LIMIT 10 RETURN UNSET(commit, "files")';
db._query(query).toArray();
retrieving number of commits related to Foxx
1
2
query = 'FOR commit IN commits FILTER LIKE(LOWER(commit.message), "%foxx%") COLLECT year = DATE_YEAR(commit.date), month = DATE_MONTH(commit.date) WITH COUNT INTO count RETURN { month: CONCAT(year, "/", month), count: count }';
db._query(query).toArray();
retrieving commits that touched the most files
1
2
query = 'FOR commit IN commits LET count = LENGTH(commit.files || []) SORT count DESC LIMIT 10 RETURN MERGE(UNSET(commit, "files"), { files: count })';
db._query(query).toArray();
retrieving files modified most often
1
2
query = 'FOR commit IN commits FOR filename IN commit.files || [] COLLECT file = filename WITH COUNT INTO count SORT count DESC LIMIT 10 RETURN { file: file, count: count }';
db._query(query).toArray();

Enjoy!