Posted Friday, 27 August 2010

The central idea is simple- Git is not a versioning system at all; atleast not in the way CVS, SVN, Perforce, or Mercurial are. This is the reason no versioning system can even come close to Git in terms of features, speed, and compactness of storage.

So what is Git? It's a key-value object store (in .git/objects) plus the infrastructure to play with it. The keys are SHA1 hashes and values can be four kinds of objects: blobs, trees, commits, and tags.

Git is often described as a "dumb content tracker". It is dumb because traditional versioning systems use a delta storage system; they use hashing algorithms to find out the differences between the way your project currently looks and the way it looked in the previous commit, figure out how to represent the differences, and finally store them in compressed deltas. Git on the other hand, simply stores a complete snapshot of the repository everytime a commit is created*; consequently, committing is a very cheap operation in Git. It is a content tracker because it separates the content of the files from the properties of the files and directory structure.

.
|-- README
`-- lib
    |-- inc
    |   `-- tricks.rb
    `-- mylib.rb

2 directories, 3 files

Inside the object store

Note: The same object is never stored twice. So, if you have a
file that's untouched between five commits, all five commit
objects will reference the same blob object. More correctly, a
blob isn't a delta- it's full text.

The above diagram is what one revision of the repository looks like in the object store. Now imagine many such commit objects connected in a DAG; consquently, when a the SHA1 hashes of the tips of the DAG (called heads) are known, every commit should be accessible. The end-user can't be expected to memorize cryptic SHA1 hashes: that's why refs in .git/refs exist.

Using this model has several ramifications. Let's say there are 100 commits in the repository- I add a commit and slap the ref moo onto it, while you create one and slap the ref foo onto it. Voila! moo and foo are branches. There's one last piece of the puzzle missing: when a command such as git commit is issued, how does Git know which commit in the DAG to link up the new commit object to? In other words, what is the "current" commit or ref I'm on? There's a special symbolic ref .git/HEAD to take care of this; switching to branch moo is as simple as a echo ref: refs/heads/moo > .git/HEAD.

The most obvious problem that comes to mind is size. If Git doesn't deltify objects, how does it afford network operations? Enter the Git packfile format. Using git pack-objects, git unpack-objects and git repack, it deltifies and packs loose objects (in .git/objects) into packfiles along with an index to locate the loose objects again (in .git/pack). To minimize network transfer, Git will ideally try to find the packfile that contains almost all the requested objects and repack it to include/ prune objects before sending it over the wire. End-users don't need to specify any objects- they only ever need to sync their local refs (in .git/refs/head) with the corresponding remote refs (in .git/refs/remotes). The painful details of the actual transfer are outlined in another ProGit chapter.

So far, we've covered how Git keeps revision history and transports it between two machines. The rest is infrastructure + "porcelain" commands for end-users. The index, log, merge, blame, bisect, rebase, and filter-branch are just among the more advertised features. Here's an outline of what's possible within the current infrastructure:

  1. Rewrite one commit somewhere in the history and automatically re-create all the dependent commits, asking for user intervention only in the case of a conflict. This is what git rebase essentially does.
  2. Replace one commit with another in-place without rewriting any other commits. See git replace.
  3. Chop the revision history somewhere in the middle and make it look like all the commits before that point never existed. This operation is called a graft.
  4. Given a file located in a certain directory in the current revision, it's possible to move the file to another directory and rewrite the history to make it look like it was always present there. See the index-filter in git filter-branch.
  5. Given the revision histories of two completely unrelated projects, it's possible to rewrite the history of one of them to make it look like the other project was developed in a subdirectory. See subtree merge.
  6. Given a line range in a file in any revision, Git can tell how it got there (taking into account how the lines might might have been moved around by other commits), and where it went in the current revision. This isn't possible yet, but git log -L is a step in this direction.

Comments