Groovy Fun with Git - Part 1 of 3
Pro Git, Scott Chacon's great book on Git, has a chapter on Git internals that is a must read, if you want to take a look under the hood, and see how Git does things. There are many blog posts, slide-presentations, and video tutorial sessions on this topic. I find Chacon's to be the shortest and the clearest.
In Pro Git's chapter 10: Git Internals, Scott uses a very simple example to show how you can do simple experiments to explore the internals of Git. He shows how Git goes about constructing its on-disk data structures (in the .git/objects directory), as you order it to do things. He shows how, using Git's "plumbing" commands, you can examine the data structure as it grows.
Reading along, I found out how brilliant Git's internal design is. It is visible using text tools. You don't need specialized GUIs or hex dump tools (compare with the nightmare from Microsoft known as the registry). Scott shows how Git is essentially a key-value store, with simple constraints, modeled after the Unix filesystem design. The brilliant part of the design is its amazing simplicity. The data structure basically consists of two node types, a "blob" node that points at a compressed blob (zlib), and a "tree" node that holds a list of pointers to other "blob" and "tree" nodes. This allows a recursive implementation of object trees (the revision tree). Git, in addition, has node types that represent references to revision trees and hold commit metadata. The commit nodes are implemented in a linked list, each commit points at it's immediately preceding commit - its parent. The ultimate parent is the first commit. Later commits are children of previous commits. A branch is simply a reference to a commit node. That's it! Git internal design can be described in one paragraph. Please see Chapter 10 of Scott's book. Also, of course, to learn Git's internals requires some exploration of commands and their effect on the internal structures.
You can explore Git's object database as you do a Unix filesystem. You can also look at Git as a NoSQL database with a high-level query language (ceramic commands), and a low-level language (plumbing commands). The internals of the data structures are a masterpiece of simple, elegant design, that is not hidden from view, but available and explorable via OS tools and low-level commands. I wish the major databases (MySQL, MS SQL Server, etc.) provided that kind of visibility into their internals, at that level of accessibility. Git is a tremendous demonstration of Linus Torvald's design genius.
As I followed along, and later while doing my own simple Git exploratory experiments, I found that I was repeatedly using a few Git plumbing commands to examine what the object database looked like. Essentially you need "git cat-file -p" to print out the content of a Git object. Scott uses a few other plumbing commands to create objects in Git. I decided that I will refrain from using plumbing commands to make things happen, since, in practice, this is a dangerous habit to get into. You can easily corrupt your git repository with a single command. But plumbing commands that are read-only and help examine the database should be fine.
In most of my experiments, I needed to see a snapshot of the data structure, similar to what you see with "tree .git/objects" but shows more information. The way I would conduct my experiments is to have three vertical panes using Gnome Terminator. In the leftmost pane, I type commands. In the center pane I show the output of "tree .git/objects", and in the rightmost pane, I show a view similar to the "tree .git/objects" but with information obtained from "git cat-file -p", or a shell script that calls it. See the screenshot below.
The "git cat-file" plumbing command shows only one object. I needed to see all the objects as the repository grew. So I wrote a simple dump tool, a small Bash script to loop through the .git/objects directory, construct the object's name (the SH1) from the two char directory name concatenated with the 38 character object name, and calling "git cat-file -p <object name>". The output from the bash script can be massaged with the usual Unix text tools (grep, sed, awk, sort, cut, etc.) to filter, sort, and format.
As I did more and more experiments to explore different aspects and commands of Git, my little Bash script proved pretty hard to extend, as I wanted more and more features.
I considered using a higher order scripting language to replace Bash (quickly, I might add - say a few hours). There is a lot of good material about using Python as a sysadmin language. For me, that would be a major diversion. I know some Python - but not enough to work fast. I decided to try Groovy (I am a Java developer mainly) as a Bash replacement language. I was very pleasantly surprised how easy that proved to be. It took less than two hours to translate my Bash script to a Groovy script. I just needed to Google how to call an OS command from Groovy and how to check error and output. The rest was regular Groovy. I found that approach, Groovy for shell scripting, to be elegant, fast enough, and perfect for my purposes. So far my purpose is only to have a tool that allows me to explore the git object database.
In subsequent posts, I will describe how to use my Groovy tool (gitobjects) to support probing experiments using Git. It really made this project fun for me. I will also go over the Groovy code.