Posted Friday, 27 August 2010
The central idea is simple- Git is not a versioning system at all; atleast not in the way CVS, SVN, Perforce, or Mercurial are. This is the reason no versioning system can even come close to Git in terms of features, speed, and compactness of storage.
So what is Git? It's a key-value object store (in
.git/objects) plus the infrastructure to play with it.
The keys are SHA1 hashes and values can be four kinds of objects:
blobs, trees, commits, and tags.
Git is often described as a "dumb content tracker". It is dumb because traditional versioning systems use a delta storage system; they use hashing algorithms to find out the differences between the way your project currently looks and the way it looked in the previous commit, figure out how to represent the differences, and finally store them in compressed deltas. Git on the other hand, simply stores a complete snapshot of the repository everytime a commit is created*; consequently, committing is a very cheap operation in Git. It is a content tracker because it separates the content of the files from the properties of the files and directory structure.
.
|-- README
`-- lib
|-- inc
| `-- tricks.rb
`-- mylib.rb
2 directories, 3 files

Note: The same object is never stored twice. So, if you have a
file that's untouched between five commits, all five commit
objects will reference the same blob object. More correctly, a
blob isn't a delta- it's full text.
The above diagram is what one revision of the repository looks
like in the object store. Now imagine many such commit objects
connected in a DAG; consquently, when a the SHA1 hashes of the tips
of the DAG (called heads) are known, every commit should be
accessible. The end-user can't be expected to memorize cryptic SHA1
hashes: that's why refs in .git/refs exist.
Using this model has several ramifications. Let's say there are
100 commits in the repository- I add a commit and slap the ref
moo onto it, while you create one and slap the ref
foo onto it. Voila! moo and
foo are branches. There's one last piece of the puzzle
missing: when a command such as git commit is issued,
how does Git know which commit in the DAG to link up the new commit
object to? In other words, what is the "current" commit or ref I'm
on? There's a special symbolic ref .git/HEAD to take
care of this; switching to branch moo is as simple as
a echo ref: refs/heads/moo > .git/HEAD.
The most obvious problem that comes to mind is size. If Git
doesn't deltify objects, how does it afford network operations?
Enter the Git packfile
format. Using git pack-objects, git
unpack-objects and git repack, it deltifies and
packs loose objects (in .git/objects) into packfiles
along with an index to locate the loose objects again (in
.git/pack). To minimize network transfer, Git will
ideally try to find the packfile that contains almost all the
requested objects and repack it to include/ prune objects before
sending it over the wire. End-users don't need to specify any
objects- they only ever need to sync their local refs (in
.git/refs/head) with the corresponding remote refs (in
.git/refs/remotes). The painful details of the actual
transfer are outlined in another ProGit chapter.
So far, we've covered how Git keeps revision history and transports it between two machines. The rest is infrastructure + "porcelain" commands for end-users. The index, log, merge, blame, bisect, rebase, and filter-branch are just among the more advertised features. Here's an outline of what's possible within the current infrastructure:
- Rewrite one commit somewhere in the history and automatically
re-create all the dependent commits, asking for user intervention
only in the case of a conflict. This is what
git rebaseessentially does. - Replace one commit with another in-place without rewriting any
other commits. See
git replace. - Chop the revision history somewhere in the middle and make it look like all the commits before that point never existed. This operation is called a graft.
- Given a file located in a certain directory in the current
revision, it's possible to move the file to another directory and
rewrite the history to make it look like it was always present
there. See the
index-filter in
git filter-branch. - Given the revision histories of two completely unrelated projects, it's possible to rewrite the history of one of them to make it look like the other project was developed in a subdirectory. See subtree merge.
- Given a line range in a file in any revision, Git can tell how
it got there (taking into account how the lines might might have
been moved around by other commits), and where it went in the
current revision. This isn't possible yet, but
git log -Lis a step in this direction.