Have you ever wondered how git works? Have you tried to figure out how git stores stuff?
Well, I have. This is what I discovered about the git object store.
In the begining there was nothing.
Not really. In fact, there’s quite a bit of stuff in an empty repo. On your command line, create a new repo and list its contents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
As you can see, there’s quite a bit of stuff there to begin with.
Something we learn right off the bat is that Git supports hooks. Take a look at some of the samples. At work, we use a pre-commit hook to run our linting before letting you commit. But I digress, I want to talk about the object store.
You’ll see that initially, the objects
directory is empty.
So, let’s create a git object. In order for this to work, you have to type the commands as shown:
1 2 |
|
As a result of adding the above object you should now see:
1 2 3 4 5 6 |
|
And then there was a (SHA1) hash
You might now be wondering how I was so confident you’d get the same output I got on my machine.
The answer is simple: Git is, at its core, just a key-value store. Git doesn’t care what you call your objects; Git only cares about the content of those objects.
To store an object, Git first peforms a few operations on the data. One of these operations is to calculate the SHA1 hash of the data and then store the data in the object store with a filename representing the hash.
At this point we have to make a brief but important aside. SHA1 has two properties that make it really useful for Git’s use:
- It’s extremely unlikely that 2 different objects will have the same hash value.
- Identical objects will always the same hash representation.
And so, 8ce0a31a54e37649ee417d60e90911258f1043
represents the SHA1 hash of “git rocks”. That’s how I knew you’d get exactly the same output I got.
And finally, there was a tree.
So now that our file is tucked away in the object store, we have to wonder: What happened to its filename? After all, Git wouldn’t be that useful if it didn’t preserve folder structure and if it didn’t let us find our files by name.
Git tracks pathnames through a tree.
Go back to your command line and type:
1 2 3 4 5 6 7 8 9 10 |
|
git write-tree
(a low level command) saves the state of the index (your staged files) to the object store. Thus, we now see a new object in the store: .git/objects/81/cbaf28bc31ce9218d51b685e35a08bfea99599
. This new object is our tree.
Once you peek into the tree object, it’ll immeidately make sense. You’ll be able to see the file contents by typing the following git command:
1 2 |
|
You might already have guessed it, but the first section of the file, the 100644
represents the file permissions (in octal). f68ce0
represents the filename of the blob in the store, and test.txt
is the filename.
Directory hierarchies are represented in a similar manner, of course.
Conclusion
And there you have it folks. That’s pretty much all there is to how Git stores objects. Pretty simple, huh?
I think it’s amazing how Linus built such a powerful and useful system by elegantly using a simple hashmap. I wish my software was more like Git.
I hope you too gain an appreciation for using simple constructs in your own code.