Some of git internals (updated)

(This is updated/revised blog post I first published in 27-Sep-2015.)

Simple test file tree

Initial commit

Here is my test file tree. There are two files and a "src" subdirectory, where another two files are stored:

install.txt
readme.txt
src         --> hello.c
                world.c

Here is their contents:

install.txt:

here are install instructions

readme.txt:

this is readme file

src/hello.c:

// this is source code for the "hello world" program

src/world.c:

// another piece of source code

Now I'm adding all these 4 files to a git repo and making initial commit. A lot of files in .git subdirectory are created. Let's focus first on "objects" directory. It has these files:

./objects/4a/cde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad
./objects/61/d7f2fcb4d4aa0c55abb07f0cca6fd6ffa91e00
./objects/2e/c39aec17a9e53d21dcdafd8cdbe3ae7ada8c57
./objects/8b/35c7d4622c1aa11531166e4bd7d1901c9d5d2b
./objects/ef/875aac086693ff89d2a21dbe2a78c34f053a73
./objects/25/457e6ce216a231dc45ad1f08449c72d2a3a674
./objects/d7/a7d9d04d26cfbfe4a492a737f4f81d993dbce6

As we may know, all these files are compressed using Zlib compressor. And there is "zpipe" util which can help to decompress Zlib-compressed files (present at least in opencaster package in Ubuntu, source code).

Let's start with the first file:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/4a/cde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad
blob 54// this is the source code for the "hello world" program

Wow, that's familiar to us. But the file compressed is prepended by a "blob 54" string. "blob" mean this object has "blob" type, i.e., contain file. "54" is length of "blob". There is also a zero byte inside, which is visible using hexdump:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/4a/cde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad | hexdump -C
00000000  62 6c 6f 62 20 35 34 00  2f 2f 20 74 68 69 73 20  |blob 54.// this |
00000010  69 73 20 73 6f 75 72 63  65 20 63 6f 64 65 20 66  |is source code f|
00000020  6f 72 20 74 68 65 20 22  68 65 6c 6c 6f 20 77 6f  |or the "hello wo|
00000030  72 6c 64 22 20 70 72 6f  67 72 61 6d 0a 0a        |rld" program..|
0000003e

We may have heard that the object ID in git is SHA1 of its contents. It is indeed so. But contents is prepended by a string consisting of object type and object size (what we just saw). We may check if this is true:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/4a/cde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad | sha1sum
4acde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad  -

True, indeed. But the first byte of SHA1 sum is subdirectory, all the rest are file name for compressed object. There are other ways to calculate git SHA1 sum: http://stackoverflow.com/questions/552659/assigning-git-sha1s-without-git.

Let's proceed to other files:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/61/d7f2fcb4d4aa0c55abb07f0cca6fd6ffa91e00
blob 33// another piece of source code

The next one:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/2e/c39aec17a9e53d21dcdafd8cdbe3ae7ada8c57
...
(Ouch, there is some unprintable contents.)

Let's try hexdump:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/2e/c39aec17a9e53d21dcdafd8cdbe3ae7ada8c57 | hexdump -C
00000000  74 72 65 65 20 37 30 00  31 30 30 36 34 34 20 68  |tree 70.100644 h|
00000010  65 6c 6c 6f 2e 63 00 4a  cd e9 ab 6d d9 bf 43 9f  |ello.c.J...m..C.|
00000020  f2 cb dd b4 7d 5e 96 b1  f2 e3 ad 31 30 30 36 34  |....}^.....10064|
00000030  34 20 77 6f 72 6c 64 2e  63 00 61 d7 f2 fc b4 d4  |4 world.c.a.....|
00000040  aa 0c 55 ab b0 7f 0c ca  6f d6 ff a9 1e 00        |..U.....o.....|
0000004e

Now this is the most interesting part: this object has a compressed "tree" object. In other words, this is just a list of files in directory. Using a naked eye, we can spot a "tree" object type at start, then "70" (object size), then access rights of "hello.c" file, then access rights for the "world.c" file. Both file names are clearly visible.

Using git we can print this obejct in prettier form:

dennis@wombat:~/tmp/test/.git$ git cat-file -p 2ec39aec17a9e53d21dcdafd8cdbe3ae7ada8c57
100644 blob 4acde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad    hello.c
100644 blob 61d7f2fcb4d4aa0c55abb07f0cca6fd6ffa91e00    world.c

These two hashes are IDs of objects where files are stored. You can find these IDs in hex dump, they are stored there in binary form to make things faster and more compact. Interestingly: the compressed object of "tree" type enlists descending objects in the tree, but doesn't specify their types, because their types can be easily deduced from the objects itself.

So this "tree" has a file list of "src" subdirectory. There are also no file/directory timestamps, because git doesn't preserve them.

Let's proceed to the rest of files:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/8b/35c7d4622c1aa11531166e4bd7d1901c9d5d2b
blob 21this is readme file

Another "tree" object:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/ef/875aac086693ff89d2a21dbe2a78c34f053a73 | hexdump -C
00000000  74 72 65 65 20 31 30 37  00 31 30 30 36 34 34 20  |tree 107.100644 |
00000010  69 6e 73 74 61 6c 6c 2e  74 78 74 00 d7 a7 d9 d0  |install.txt.....|
00000020  4d 26 cf bf e4 a4 92 a7  37 f4 f8 1d 99 3d bc e6  |M&......7....=..|
00000030  31 30 30 36 34 34 20 72  65 61 64 6d 65 2e 74 78  |100644 readme.tx|
00000040  74 00 8b 35 c7 d4 62 2c  1a a1 15 31 16 6e 4b d7  |t..5..b,...1.nK.|
00000050  d1 90 1c 9d 5d 2b 34 30  30 30 30 20 73 72 63 00  |....]+40000 src.|
00000060  2e c3 9a ec 17 a9 e5 3d  21 dc da fd 8c db e3 ae  |.......=!.......|
00000070  7a da 8c 57                                       |z..W|
00000074

It contains a file list of the root directory: two files "install.txt" and "readme.txt" and also "src" subdirectory. Let's dump it using git utility:

dennis@wombat:~/tmp/test/.git$ git cat-file -p ef875aac086693ff89d2a21dbe2a78c34f053a73
100644 blob d7a7d9d04d26cfbfe4a492a737f4f81d993dbce6    install.txt
100644 blob 8b35c7d4622c1aa11531166e4bd7d1901c9d5d2b    readme.txt
040000 tree 2ec39aec17a9e53d21dcdafd8cdbe3ae7ada8c57    src

IDs of files are IDs of objects where its contents is stored. ID of "src" tree is just ID of another "tree" object, which was inspected earlier.

The next object:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/25/457e6ce216a231dc45ad1f08449c72d2a3a674
commit 189tree ef875aac086693ff89d2a21dbe2a78c34f053a73
author Dennis Yurichev  1442582288 +0300
committer Dennis Yurichev  1442582288 +0300

initial commit

This is object representing "commit". It has an ID of a tree plus some additional information, including date/time of commit. We can be sure that this is where current HEAD pointing:

dennis@wombat:~/tmp/test/.git$ cat HEAD
ref: refs/heads/master

dennis@wombat:~/tmp/test/.git$ cat refs/heads/master
25457e6ce216a231dc45ad1f08449c72d2a3a674
25457e6ce216a231dc45ad1f08449c72d2a3a674 is the ID of the object with "commit" information we just saw.

And the last object is just another blob:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/d7/a7d9d04d26cfbfe4a492a737f4f81d993dbce6
blob 31here are install instructions

Adding "cloned" file

Now let's add another file, which has exact contents of what we already have:

dennis@wombat:~/tmp/test$ cp src/hello.c src/hello.c_copy

These objects we see after commit:

./objects/4a/cde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad
./objects/61/d7f2fcb4d4aa0c55abb07f0cca6fd6ffa91e00
./objects/2e/c39aec17a9e53d21dcdafd8cdbe3ae7ada8c57
./objects/8b/35c7d4622c1aa11531166e4bd7d1901c9d5d2b
./objects/2c/2a5998e0fbbb227605c9e48f8120d4a1326215 (new file)
./objects/ef/875aac086693ff89d2a21dbe2a78c34f053a73
./objects/0f/98834ba27232f2bd0d3fc8954ec805812cea3e (new file)
./objects/7b/911b9bc417505e7fbe329c1496ac55b9bf971d (new file)
./objects/25/457e6ce216a231dc45ad1f08449c72d2a3a674
./objects/d7/a7d9d04d26cfbfe4a492a737f4f81d993dbce6

There are 3 new objects. But no objects removed (because git doesn't delete anything). What is inside of new objects?

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/7b/911b9bc417505e7fbe329c1496ac55b9bf971d | hexdump -C
00000000  74 72 65 65 20 31 31 30  00 31 30 30 36 34 34 20  |tree 110.100644 |
00000010  68 65 6c 6c 6f 2e 63 00  4a cd e9 ab 6d d9 bf 43  |hello.c.J...m..C|
00000020  9f f2 cb dd b4 7d 5e 96  b1 f2 e3 ad 31 30 30 36  |.....}^.....1006|
00000030  34 34 20 68 65 6c 6c 6f  2e 63 5f 63 6f 70 79 00  |44 hello.c_copy.|
00000040  4a cd e9 ab 6d d9 bf 43  9f f2 cb dd b4 7d 5e 96  |J...m..C.....}^.|
00000050  b1 f2 e3 ad 31 30 30 36  34 34 20 77 6f 72 6c 64  |....100644 world|
00000060  2e 63 00 61 d7 f2 fc b4  d4 aa 0c 55 ab b0 7f 0c  |.c.a.......U....|
00000070  ca 6f d6 ff a9 1e 00                              |.o.....|
00000077

... or:

dennis@wombat:~/tmp/test/.git$ git cat-file -p 7b911b9bc417505e7fbe329c1496ac55b9bf971d
100644 blob 4acde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad    hello.c
100644 blob 4acde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad    hello.c_copy
100644 blob 61d7f2fcb4d4aa0c55abb07f0cca6fd6ffa91e00    world.c

This is the new "tree" object for "src" subdirectory, it has a copied file. But what is important: IDs are the same! Because they have the same contents, so git hadn't added another copy of file! git is content-addressable storage, about which you can read more here.

The "tree" is different, however, so this new object is created. Older "tree" version of this directory is still present and will always be.

Now the second object:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/0f/98834ba27232f2bd0d3fc8954ec805812cea3e | hexdump -C
00000000  74 72 65 65 20 31 30 37  00 31 30 30 36 34 34 20  |tree 107.100644 |
00000010  69 6e 73 74 61 6c 6c 2e  74 78 74 00 d7 a7 d9 d0  |install.txt.....|
00000020  4d 26 cf bf e4 a4 92 a7  37 f4 f8 1d 99 3d bc e6  |M&......7....=..|
00000030  31 30 30 36 34 34 20 72  65 61 64 6d 65 2e 74 78  |100644 readme.tx|
00000040  74 00 8b 35 c7 d4 62 2c  1a a1 15 31 16 6e 4b d7  |t..5..b,...1.nK.|
00000050  d1 90 1c 9d 5d 2b 34 30  30 30 30 20 73 72 63 00  |....]+40000 src.|
00000060  7b 91 1b 9b c4 17 50 5e  7f be 32 9c 14 96 ac 55  |{.....P^..2....U|
00000070  b9 bf 97 1d                                       |....|
00000074

... or:

dennis@wombat:~/tmp/test/.git$ git cat-file -p 0f98834ba27232f2bd0d3fc8954ec805812cea3e
100644 blob d7a7d9d04d26cfbfe4a492a737f4f81d993dbce6    install.txt
100644 blob 8b35c7d4622c1aa11531166e4bd7d1901c9d5d2b    readme.txt
040000 tree 7b911b9bc417505e7fbe329c1496ac55b9bf971d    src

It has a new "tree" object for root directory. It has changed because the ID of "src" has changed its ID (because it has different content and SHA1 hash).

The last object:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/2c/2a5998e0fbbb227605c9e48f8120d4a1326215
commit 236tree 0f98834ba27232f2bd0d3fc8954ec805812cea3e
parent 25457e6ce216a231dc45ad1f08449c72d2a3a674
author Dennis Yurichev  1442585229 +0300
committer Dennis Yurichev  1442585229 +0300

second commit

It is the second "commit" object. In contrast of the initial commit object we already saw, this one has "parent" link, which is an ID of a previous commit object. Oh, and the current HEAD pointing to this new (second) commit:

dennis@wombat:~/tmp/test/.git$ cat refs/heads/master
2c2a5998e0fbbb227605c9e48f8120d4a1326215

If the "cat-file" output is too raw for you, there is always "git show" utility, which shows object contents in prettier form:

dennis@wombat:~/tmp/test/.git$ git show 2c2a5998e0fbbb227605c9e48f8120d4a1326215
commit 2c2a5998e0fbbb227605c9e48f8120d4a1326215
Author: Dennis Yurichev 
Date:   Fri Sep 18 17:07:09 2015 +0300

    second commit

diff --git a/src/hello.c_copy b/src/hello.c_copy
new file mode 100644
index 0000000..4acde9a
--- /dev/null
+++ b/src/hello.c_copy
@@ -0,0 +1,2 @@
+// this is source code for the "hello world" program
+

By the way, old object files were not modified after the second commit! Of course not, they still has same IDs.

Deleting a file

Let's delete some file:

dennis@wombat:~/tmp/test$ rm install.txt

This is object list after third commit:

./objects/4a/cde9ab6dd9bf439ff2cbddb47d5e96b1f2e3ad
./objects/61/d7f2fcb4d4aa0c55abb07f0cca6fd6ffa91e00
./objects/2e/c39aec17a9e53d21dcdafd8cdbe3ae7ada8c57
./objects/8b/35c7d4622c1aa11531166e4bd7d1901c9d5d2b
./objects/ea/7af6190471c3571899ae68281fbd9b3bf82c71 (new file)
./objects/2c/2a5998e0fbbb227605c9e48f8120d4a1326215
./objects/0c/077dd09d6ff4a8c90bf14226ce5060db57ad94 (new file)
./objects/ef/875aac086693ff89d2a21dbe2a78c34f053a73
./objects/0f/98834ba27232f2bd0d3fc8954ec805812cea3e
./objects/7b/911b9bc417505e7fbe329c1496ac55b9bf971d
./objects/25/457e6ce216a231dc45ad1f08449c72d2a3a674
./objects/d7/a7d9d04d26cfbfe4a492a737f4f81d993dbce6

What is in new files?

First:

dennis@wombat:~/tmp/test/.git$ git cat-file -p 0c077dd09d6ff4a8c90bf14226ce5060db57ad94
100644 blob 8b35c7d4622c1aa11531166e4bd7d1901c9d5d2b    readme.txt
040000 tree 7b911b9bc417505e7fbe329c1496ac55b9bf971d    src

Second:

dennis@wombat:~/tmp/test/.git$ zpipe -d < objects/ea/7af6190471c3571899ae68281fbd9b3bf82c71
commit 256tree 0c077dd09d6ff4a8c90bf14226ce5060db57ad94
parent 2c2a5998e0fbbb227605c9e48f8120d4a1326215
author Dennis Yurichev  1442587436 +0300
committer Dennis Yurichev  1442587436 +0300

There are just two objects created: new tree for root directory and new commit. By the way, other objects hasn't been modified in this case as well.

Recreating repository

Would it be possible for me to create a such demo repository with the same IDs? Yes: if you'll recreate all text files byte-by-byte, so they will have the same SHA1 IDs, both blobs and trees. But "commit" objects will be different, because, likely, you'll create it under your name/email and in another date/time. But if you'll go that far, you can recreate commit objects just like mine, of course.

Nonetheless, for those who interested, the repositories I experimented with are here.

Trees are always sorted

... otherwise, two identical trees could result in different SHA1 IDs. This is hash tree (or Merkle tree): wikipedia.

Linux Kernel tree

I've cloned latest Linux kernel and git checked out all the files. But let's see, would it be possible to walk on file tree of some specific Linux release without checking out all the files (there are a lot of them, so it's slow!) When I write this article, current Linux kernel version is v4.3-rc1, but let's take a look on v3.0:

dennis@bigbox:~/src/linux-kernel/linux-git$ git show v3.0
tag v3.0
Tagger: Linus Torvalds 
Date:   Thu Jul 21 19:17:29 2011 -0700

Linux 3.0

w00t!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.16 (GNU/Linux)

iEYEABECAAYFAk4o3cgACgkQF3YsRnbiHLvw3gCfYIx8Sw/s3s4LNh8ij7Rb3ltt
9RsAoJgdZdHrRrWeB3G1q92FcJtbMHu9
=m0tn
-----END PGP SIGNATURE-----

commit 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
Author: Linus Torvalds 
Date:   Thu Jul 21 19:17:23 2011 -0700

    Linux 3.0
    ...

This is a "tag" object (4th one, there are only four object types in git: blob, tree, commit, tag). Here we can see commit ID: 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe. Let's see:

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -p 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
tree df32c75242bf8d797ccd43af8ce8e294f35cd8fd
parent 1f922d07704c501388a306c78536bca7432b3934
author Linus Torvalds  1311301043 -0700
committer Linus Torvalds  1311301043 -0700
Linux 3.0

Now this is a "commit" object, it has a link to a tree ID df32c75242bf8d797ccd43af8ce8e294f35cd8fd. Let's take a look into the tree:

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -p df32c75242bf8d797ccd43af8ce8e294f35cd8fd
100644 blob 9dacde0a4b2dcec4ce33013354b6c46738daaef7    .gitignore
100644 blob 353ad5607156f0f2d4b5138f1b5d9db529933aa8    .mailmap
100644 blob ca442d313d86dc67e0a2e5d584b465bd382cbf5c    COPYING
100644 blob 1deb331d96edbe60f3026ce4175dc0efe3bffaa4    CREDITS
040000 tree dfcc154f765364b00ea39827f9e337314b498291    Documentation
100644 blob 2114113ceca2801770c57ac07c78fff2b0b8a477    Kbuild
100644 blob c13f48d65898487105f0193667648382c90d0eda    Kconfig
100644 blob 187282da92137a2ea147bcb0f664e0e99d90e681    MAINTAINERS
100644 blob 6a5bdad524affe34c62141ae9678da115fcc28ee    Makefile
100644 blob 0d5a7ddbe3ee8d108bf7079909ddcec30dfb4560    README
100644 blob 55a6074ccbb715d99b642fa510d3c993121f453d    REPORTING-BUGS
040000 tree 3c904b66b173d88f424b957c9d63d9df9a7dfceb    arch
040000 tree 1a854f3d15191bd1313eee6561bc2f5192e70281    block
040000 tree 99f9616822f57a4a138e96b3e125be382adca109    crypto
040000 tree 912f6edfe690c212e2ea8a2d912372c9657689ba    drivers
040000 tree 0bfbce5a445d533c1e84a53db5e7665d49e8141d    firmware
040000 tree e9a60176c0b7aa680f16a744a813075f3d7fcdf5    fs
040000 tree 7caa537dae98db80e1d7e4b8de800e3ca98a9330    include
040000 tree 5746e99029889fab537a87c5b410619d328a70da    init
040000 tree 0bcfb00e995f70e15d50fb488826215f35d7e6a7    ipc
040000 tree 29cafccb28f5a39ec85392099f3d980afcbec5e7    kernel
040000 tree ea9f4393812615ca8667f4b6be6de0fa6a20e3c6    lib
040000 tree 90fd1ee8343340a2e5e9089312c32f5dbde313c4    mm
040000 tree be49bafeb7aeeb9dcce2da58717cb1d60101d9ae    net
040000 tree c8074a39cb7977631a23de3d3df1563f7e99fddb    samples
040000 tree f66d97ab104ae00a8f1e3f337e0d3b112a3d4ecc    scripts
040000 tree db1c9d45c75a87a0eab10bd081272ad2c7410340    security
040000 tree 76c0db062b1dfdae7960e45511b38519f33edbb7    sound
040000 tree 9699f25d5bcb2784cc710785777748836a8cfcfb    tools
040000 tree 1e3684ddfd7c6b964e6ca2ad6498e3f8b24a7762    usr
040000 tree ab7e9c5a83bd4ea0b265a745933cec9e89e30946    virt

This is how Linux kernel root tree looked like in version v3.0. Let's peek into the "arch" subdirectory:

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -p 3c904b66b173d88f424b957c9d63d9df9a7dfceb
100644 blob 7414689203208919d7de8758e750cf4e2d8f0961    .gitignore
100644 blob 26b0e2397a5724140bc8683fedd6bbae67706fe5    Kconfig
040000 tree 2b990a6397576c8e00940f1c817adf8e1a5df273    alpha
040000 tree fc015283e022339d799c9eedd23d4f62e92e0aec    arm
040000 tree 794853b6df19fc8a8eeb54926962b496f0e6c941    avr32
040000 tree 4697f8f4c2668cd4c088330a1010cc4c47af7dcf    blackfin
040000 tree 336265a1bb4ce023a7b1e0fcd36cfe9a38ff35f2    cris
040000 tree f94cb4276f51f312ce7338c7108ebf2038bc508a    frv
040000 tree b4ca635f1f809680e3afb681a0c8ca9969cc56a7    h8300
040000 tree 9bf423815cfb3f3f50f065d3d202f22afd8c8952    ia64
040000 tree cd73a96bb8b271cd4dc7ccc25f507886724418a4    m32r
040000 tree c4e86bbe47f34dd22d2f1c70e0e5a75d920274a1    m68k
040000 tree d58fa182a92f15ced603cc103f80f9a40b07f013    microblaze
040000 tree 949a999878382076e88e933771a682e0b0eeae04    mips
040000 tree 4f29482ed729593246f9c32d2fc09bb3ddc15d0a    mn10300
040000 tree de143a7b81bd6eaedbd5bc2ce7f995d3b415ec27    parisc
040000 tree f6ea0236658d350aded70545e2f0f46ea9e47783    powerpc
040000 tree c4fd94dd807ef9935576b1f6d90e0e6b727f8f57    s390
040000 tree 5ec2a83d90b6856f39f7fef6493de9c9864d045f    score
040000 tree 27b918c6fe277006b8b8f7515ef1bdcbad018cac    sh
040000 tree 06697af330f49913a3d816e4a97255cc55542bcb    sparc
040000 tree bee5e83168bc3d578e144b43357771f5da984230    tile
040000 tree d68a235cb0b4fb29a64c3e274844de884d775355    um
040000 tree b828e78dbcd3530484b285d4bd3e7570e9659b19    unicore32
040000 tree 9b877cfb8890bad2ae9f0f310ddf6e44596f3315    x86
040000 tree f9941e6c45c878838ba4390dc26b8d06b13369c4    xtensa

What the arch/Kconfig file had inside in Linux v3.0?

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -p 26b0e2397a5724140bc8683fedd6bbae67706fe5

#
# General architecture dependent options
#

config OPROFILE
        tristate "OProfile system profiling"
        depends on PROFILING
        depends on HAVE_OPROFILE
        select RING_BUFFER
        select RING_BUFFER_ALLOW_SWAP
        help
          OProfile is a profiling system capable of profiling the
          whole system, include the kernel, kernel modules, libraries,
          and applications.

          If unsure, say N.
...

All IDs are real, so anyone can get Linux kernel git repository and repeat my steps. Isn't it cool?

SHA1

SHA1 was cryptographically secure, so as a developer, you could identify your tree using just one single SHA1 ID. Alas, SHA1 has been broken in 2015-2019, but git is still usable with old SHA1 --- like CRC32 is usable despite the fact it can be forged easily.

Maybe Linus Torvalds and/or git developers would add another hash algorithm (yet unbroken) --- but things will remain basically the same.

.git/HEAD

This text file usually has a name of current branch:

dennis@wombat:~/tmp/test$ cat .git/HEAD
ref: refs/heads/master

... which is, in turn, has an ID of the last commit:

dennis@wombat:~/tmp/test$ cat .git/refs/heads/master
d2315fbf4d1497c8876cc4801c250fc59e987487

git log

When you run "git log", it shows the information about commit to which .git/HEAD is currently pointing. Then it takes the commit ID from a "parent" link and does the same. Then again, recursively. It stops when it finds a commit with no "parent" link, i.e., the first one.

Two types of git tags

Lightweight tag is just a pointer to some commit. Here I'm creating a pointer to the very first commit in my demo repository:

dennis@wombat:~/tmp/test$ git log
...
commit 25457e6ce216a231dc45ad1f08449c72d2a3a674
Author: Dennis Yurichev 
Date:   Fri Sep 18 16:18:08 2015 +0300

    initial commit

dennis@wombat:~/tmp/test$ git tag v1 25457e6ce216a231dc45ad1f08449c72d2a3a674

dennis@wombat:~/tmp/test$ git tag
v1

dennis@wombat:~/tmp/test$ cat .git/refs/tags/v1
25457e6ce216a231dc45ad1f08449c72d2a3a674

Lightweight tags are stored in text files under .git/refs/tags

Annotated tag is a tag stored in an object with "tag" type. It has an additional (optionally PGP-signed) text and a link to commit ID. Here I use "v3.0" in command line as an alias of an object ID with "tag" type:

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -t v3.0
tag

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -p v3.0
object 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
type commit
tag v3.0
tagger Linus Torvalds  1311301049 -0700

Linux 3.0

w00t!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.16 (GNU/Linux)

iEYEABECAAYFAk4o3cgACgkQF3YsRnbiHLvw3gCfYIx8Sw/s3s4LNh8ij7Rb3ltt
9RsAoJgdZdHrRrWeB3G1q92FcJtbMHu9
=m0tn
-----END PGP SIGNATURE-----

dennis@bigbox:~/src/linux-kernel/linux-git$ git cat-file -p 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
tree df32c75242bf8d797ccd43af8ce8e294f35cd8fd
parent 1f922d07704c501388a306c78536bca7432b3934
author Linus Torvalds  1311301043 -0700
committer Linus Torvalds  1311301043 -0700

Linux 3.0

Branching

Branching is just diverging from some commit to two or more different commits. You create a new branch, but "master" branch still points to some commit. You do some work, and then you can see a difference between your current branch and "master". You can checkout files from "master" branch again, or from any other commit.

Merging

Let's create a new branch:

dennis@wombat:~/tmp/test$ git branch new_branch

dennis@wombat:~/tmp/test$ git checkout new_branch
Switched to branch 'new_branch'

dennis@wombat:~/tmp/test$ vi readme.txt

dennis@wombat:~/tmp/test$ vi src/hello.c

dennis@wombat:~/tmp/test$ git diff
diff --git a/readme.txt b/readme.txt
index 8b35c7d..3b7f9f2 100644
--- a/readme.txt
+++ b/readme.txt
@@ -1,2 +1,4 @@
 this is readme file
+I'll add some very cool feature A in this branch!
+
diff --git a/src/hello.c b/src/hello.c
index 4acde9a..3176765 100644
--- a/src/hello.c
+++ b/src/hello.c
@@ -1,2 +1,4 @@
 // this is source code for the "hello world" program

+// new feature A!
+

dennis@wombat:~/tmp/test$ git commit -a -m"feature A"
[new_branch 1782b8d] feature A
 2 files changed, 4 insertions(+)

Now we switching to master and creating another branch:

dennis@wombat:~/tmp/test$ git checkout master
Switched to branch 'master'

dennis@wombat:~/tmp/test$ git branch another_branch

dennis@wombat:~/tmp/test$ git checkout another_branch
Switched to branch 'another_branch'

dennis@wombat:~/tmp/test$ vi readme.txt

dennis@wombat:~/tmp/test$ vi src/world.c

dennis@wombat:~/tmp/test$ git diff
diff --git a/readme.txt b/readme.txt
index 8b35c7d..a55e832 100644
--- a/readme.txt
+++ b/readme.txt
@@ -1,2 +1,4 @@
 this is readme file

+feature B added!
+
diff --git a/src/world.c b/src/world.c
index 61d7f2f..661d47f 100644
--- a/src/world.c
+++ b/src/world.c
@@ -1,2 +1,3 @@
 // another piece of source code

+// feature B is here

dennis@wombat:~/tmp/test$ git commit -a -m"feature B"
[another_branch 085eb71] feature B
 2 files changed, 3 insertions(+)

These heads has 3 different IDs, each of which pointing to 3 different trees. master - original tree; new_branch is where "feature A" has been added; another_branch is where "feature B" has been added.

dennis@wombat:~/tmp/test$ cat .git/refs/heads/another_branch
085eb716475f3f2c83ccd7f0ef0b363697049f2e

dennis@wombat:~/tmp/test$ cat .git/refs/heads/master
ea7af6190471c3571899ae68281fbd9b3bf82c71

dennis@wombat:~/tmp/test$ cat .git/refs/heads/new_branch
1782b8ddab75fb4fdc7fd10a6d122c3321057eca

Let's merge another_branch and new_branch (current branch is still another_branch):

dennis@wombat:~/tmp/test$ git merge new_branch
Auto-merging readme.txt
CONFLICT (content): Merge conflict in readme.txt
Automatic merge failed; fix conflicts and then commit the result.

dennis@wombat:~/tmp/test$ vi readme.txt

dennis@wombat:~/tmp/test$ git commit -a -m"conflict resolved"
[another_branch d2315fb] conflict resolved

Merge commit is the commit with two parent links (which are two commit IDs):

dennis@wombat:~/tmp/test$ git log
commit d2315fbf4d1497c8876cc4801c250fc59e987487
Merge: 085eb71 1782b8d
Author: Dennis Yurichev 
Date:   Sun Sep 20 14:11:11 2015 +0300
    conflict resolved

dennis@wombat:~/tmp/test$ git cat-file -p d2315fbf4d1497c8876cc4801c250fc59e987487
tree e46fc118b45d21b16d8a9739ceea24e998acd628
parent 085eb716475f3f2c83ccd7f0ef0b363697049f2e
parent 1782b8ddab75fb4fdc7fd10a6d122c3321057eca
author Dennis Yurichev  1442747471 +0300
committer Dennis Yurichev  1442747471 +0300

conflict resolved

"tree" points to the newly consructed (during merge) file tree.

"another_branch" text file is now pointing to the last commit we just made:

dennis@wombat:~/tmp/test$ cat .git/refs/heads/another_branch
d2315fbf4d1497c8876cc4801c250fc59e987487

(All other heads wasn't changed during merge).

Make the current branch as master:

dennis@wombat:~/tmp/test$ git checkout master
Switched to branch 'master'

dennis@wombat:~/tmp/test$ git merge another_branch
Updating ea7af61..d2315fb
Fast-forward
 readme.txt  | 4 ++++
 src/hello.c | 2 ++
 src/world.c | 1 +
 3 files changed, 7 insertions(+)

Now master branch has the same ID as "another_branch", they are indistinguishable:

dennis@wombat:~/tmp/test$ cat .git/refs/heads/master
d2315fbf4d1497c8876cc4801c250fc59e987487

dennis@wombat:~/tmp/test$ cat .git/refs/heads/another_branch
d2315fbf4d1497c8876cc4801c250fc59e987487

dennis@wombat:~/tmp/test$ cat .git/refs/heads/new_branch
1782b8ddab75fb4fdc7fd10a6d122c3321057eca

Now we can delete other branches safely, which is merely deleting pointer files (all objects files will be left untouched):

dennis@wombat:~/tmp/test$ git branch -d another_branch
Deleted branch another_branch (was d2315fb).

dennis@wombat:~/tmp/test$ git branch -d new_branch
Deleted branch new_branch (was 1782b8d).

dennis@wombat:~/tmp/test$ ls .git/refs/heads/
master

git stash

The git stash command commites your changes into a separate commit ID and restores file tree from your last commit (or current HEAD, which hasn't been touched yet). All these commit IDs are saved in the .git/refs/stash text file:

dennis@wombat:~/tmp/test$ vi readme.txt

dennis@wombat:~/tmp/test$ git diff
diff --git a/readme.txt b/readme.txt
index 8b35c7d..1907f4f 100644
--- a/readme.txt
+++ b/readme.txt
@@ -1,2 +1,4 @@
 this is readme file

+let me write more here... ouch, I've been interrupted.
+

dennis@wombat:~/tmp/test$ git stash
Saved working directory and index state WIP on master: ea7af61 third commit: install.txt deleted
HEAD is now at ea7af61 third commit: install.txt deleted

dennis@wombat:~/tmp/test$ cat .git/refs/stash
a7ef0db2f3cd7b16653360c78b6051b0e6252b22

dennis@wombat:~/tmp/test$ git stash list
stash@{0}: WIP on master: ea7af61 third commit: install.txt deleted

dennis@wombat:~/tmp/test$ git show a7ef0db2f3cd7b16653360c78b6051b0e6252b22
commit a7ef0db2f3cd7b16653360c78b6051b0e6252b22
Merge: ea7af61 13f7229
Author: Dennis Yurichev 
Date:   Sun Sep 20 13:47:38 2015 +0300

    WIP on master: ea7af61 third commit: install.txt deleted

diff --cc readme.txt
index 8b35c7d,8b35c7d..1907f4f
--- a/readme.txt
+++ b/readme.txt
@@@ -1,2 -1,2 +1,4 @@@
  this is readme file
++let me write more here... ouch, I've been interrupted.
++

Pull requests

Pull request is an offer to merge changes from other's tree. Let's say, @regehr at GitHub wants to fix a bug in Z3 repository. He makes a fork at GitHub, the operation that clones (or copying) repository to his own account. He makes changes and commits it on his local computer. His repository is now different from what is at Z3's GitHub account at the moment: it has several additional commits. He then pushes his repository to his personal space at GitHub, in other words, to his fork of Z3's repository. Now he making "pull request", an offer to Z3's maintainers to grab the changes: https://github.com/Z3Prover/z3/pull/4833. If everything can go smoothly, GitHub can merge automatically.

Now my own GitHub repository at https://github.com/Z3Prover/z3 has commits from @regehr.

Now let's imagine, Z3's maintainer, on his local computer, worked on repository and made some changes and commited them. Let's say, he forgot about @regehr's contribution and trying to push his changes to his GitHub account. git doesn't allow to do so ("remote rejected"), because repositories (his local and current at GitHub) are different. So he first have to pull changes from his GitHub account to his local computer (merging). Hi will merge his work with @regehr's work. Only after that he can push the most updated version of repository to his GitHub repository.

In case of conflicts (two developers edited a same line in a same file), GitHub offers to merge on your local computer, because it requires conflict resolution in your text editor (constructing a line from two versions of line): https://help.github.com/articles/checking-out-pull-requests-locally/.

In this case, it's possible to clone @regehr's repository from https://github.com/regehr/z3 to a local repository, and do the work locally. This is what people do who do not use GitHub.

Pull request may contain several commits.

Pull request is not mandatory: maintainers can merge it with their tree, may not, or may do this in the future at the right moment.

Detached head

The .git/HEAD file usually contains a current branch name. For very basic git usage, current head is always pointing to a "master" branch.

Now let's say I want to compile Linux kernel v3.0. How to check out files for Linux v3.0?

dennis@bigbox:~/src/linux-kernel/linux-git$ git checkout v3.0
Checking out files: 100% (56057/56057), done.
Note: checking out 'v3.0'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 02f8c6a... Linux 3.0

Let's see current contents of the HEAD file:

dennis@bigbox:~/src/linux-kernel/linux-git$ cat .git/HEAD
02f8c6aee8df3cdc935e9bdd4f2d020306035dbe

That means, the current head is now points to somewhere in the middle of the whole Linux commit history. It is indeed points to the commit where the author added "v3.0" tag:

dennis@bigbox:~/src/linux-kernel/linux-git$ git show 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
commit 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
Author: Linus Torvalds 
Date:   Thu Jul 21 19:17:23 2011 -0700

    Linux 3.0

...

Now I can compile Linux v3.0. How to get back? Just checkout files of the "master" branch:

dennis@bigbox:~/src/linux-kernel/linux-git$ git checkout master
Checking out files: 100% (56057/56057), done.
Previous HEAD position was 02f8c6a... Linux 3.0
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

Current head is now points to "master" branch:

dennis@bigbox:~/src/linux-kernel/linux-git$ cat .git/HEAD
ref: refs/heads/master

What if I want to change something in v3.0 and commit my changes? Well, it's not possible to overwrite commits happens after "v3.0" commit. But you can create new branch stemming from "v3.0" commit (as git proposes) and work there.

git fsck

So, speaking theoretically, git objects form a graph. But sometimes, some objects may be dropped due to some operations:

Usually, dangling blobs and trees aren't very interesting. They're almost always the result of either being a half-way mergebase (the blob will often even have the conflict markers from a merge in it, if you have had conflicting merges that you fixed up by hand), or simply because you interrupted a 'git fetch' with ^C or something like that, leaving _some_ of the new objects in the object database, but just dangling and useless.

( https://github.com/git/git/blob/master/Documentation/user-manual.txt )

In this case, object will be present, but its ID will never be used. It's like a file on filesystem which is deleted (no file system data structures has its address or ID), but its contents is still on your disk. It's also possible in git, and "git fsck" does cleanup: the utility forms a graph and drops objects which are not member of the graph.

Git packfiles

Keeping each object in each file under a .git subdirectory is good for 1) demonstration, like this one; 2) faster access to objects. But more economical way of keeping records is to store objects in compressed pack files. The objects there are addressed in the same way, but stored in the single file (like WAD file in DOOM game or PAK file in Quake game). The git gc command forces git to pack your objects to packfiles.

git as versioning file system.

This is why git is often called versioning file system: besides the fact git stores each version of each file, it also stores the whole file tree at each moment of time.

.git/FETCH_HEAD

You may heard that git pull is in fact two commands: git fetch and git merge. Yes, git fetch get objects from remove repo to yours, but doesn't alter your repo. It updates the .git/FETCH_HEAD file with ID.

Now you can diff your master with the last commit of the repo you just fetched: git diff FETCH_HEAD HEAD. If everything is OK, you can merge: git merge FETCH_HEAD.

I used this when pulling commits from contributors to my projects without using GitHub.

Text files under .git

There are many text files that worth studying. Many of them are editable. I found editing info about remote repos (its names, URLs) with a text editor is simpler than using git remote command. Also, unused entries can be (un)commented using ';'.


A discussion at reddit: https://www.reddit.com/r/git/comments/3ml4qj/some_of_git_internals/.

UPD: another discussion at HN (December 2020).

UPD: the next part.


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.