François Pinard's site

Git pack compression

Why did I want to know more about object access in packs?

La compression de Git exige qu'il découvre des fichiers très semblables quoique différents, il est probable que les algorithmes de compression et les algorithmes de recherche de similitude ont du code en commun. Git construit optionnellement un pack qui, dans un fichier unique, contient une foule d'objets originalement dans des fichiers séparés.

J'avais lu que les objets de Git (fichiers, répertoires, tags, commits) étaient comprimés avec zlib dans un pack. Hors, zlib utilise un algorithme adaptatif (en clair: les tables de décompression sont construites dynamiquement pendant la décompression séquentielle, à partir même du contenu du fichier — et donc, il n'est pas possible de commencer à décomprimer sans avoir tout lu depuis le début). Comme Git est super-efficace pour aller chercher n'importe quel objet d'un pack, je ne comprenais pas comment ça pouvait marcher…

For creating a pack, all objects are first sorted, using the object type as the first key, some of its basename as the second key, and decreasing size as the third key. Then a sliding window is created over these, and (using efficient trickery) all deltas between all objects within the sliding window are computed, and smallest deltas are retained for each object, as well as the object with which this delta is possible (this relation is bidirectional). Deltas too big for being efficient are discarded.

Then, the directionality is decided in such a way that most recent objects (that is, those likely to be needed first in other algorithms) are output first and whole, and deltas are used for deriving other objects. All objects are compressed separately, there is no zlib adaptation from an object to another. There are trickeries within the index as well, but deep down, the index directly points to objects and decompression starts from there. In a word, Git deltas do part of the compression, as zlib is not meant to be especially good on short files. The set of chosen packing algorithms is robust to wrong decisions, it will work anyway, just slower and bigger. There are a lot of heuristics built-in and, on the wide average, it all appears to work pretty well.

git:// is so superior to http:// because the algorithm is bidirectional at start, the on-the-fly packing which occurs at the server spares the retaining of information that it knows the receiver already has.