Hot Posts

6/recent/ticker-posts

Highlights from Git 2.44

[Collection]

The open source Git project just released Git 2.44 with features and bug fixes from over 85 contributors, 34 of them new. We last caught up with you on the latest in Git back when 2.43 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Faster pack generation with multi-pack reuse

If you’ve ever looked closely at Git’s output when pushing or pulling a repository to/from GitHub1, you might have noticed the pack-reused number that appears at the end of your output, like so:

$ git clone git@github.com:git/git.git
Cloning into 'git'...
remote: Enumerating objects: 361232, done.
remote: Counting objects: 100% (942/942), done.
remote: Compressing objects: 100% (453/453), done.
remote: Total 361232 (delta 598), reused 773 (delta 487), pack-reused 360290
[...]

If you’ve ever looked at that number (above, this is pack-reused 360290), and wondered what it meant, then look no further!

In general terms, that number refers to how much of the pack GitHub was able to send by (more-or-less) streaming verbatim sections of a pack that already exists down to the cloner, instead of generating a new pack on the fly. When Git is sending objects to the client (when fetching/cloning), to the server (when pushing), or to itself (when repacking), Git needs to generate a packfile that contains the set of objects being transferred. For many of the objects in this pack, Git will locate those objects, open and parse them, then optionally try and pair them with some existing object to form a delta chain.

Repeating this process over all objects in the pack yields a more compact result, since Git will find and pair objects who have similar content to one another to save space. When pushing a small amount of data to GitHub, this search is usually negligible and doesn’t take a significant amount of time. But during a clone, loading and trying to re-delta-ify all of the reachable objects in a repository can become prohibitively expensive, especially when carried out over tens of thousands of clones or more.

To save time, Git takes a shortcut: because the wire format used to transfer objects uses the same representation as the .pack files on disk (in $GIT_DIR/objects/pack), it can reuse sections of an existing packfile byte-for-byte when generating the new pack to send down to the client.

In our above example, that’s exactly what happened: the pack-reused 360290 portion of our output indicated that GitHub was able to reuse 360,290 objects from disk without having to re-open and search for new deltas. That process was carried out only over the remaining objects (in this case, 361,232 less the reused quantity gives us just over 900 objects that took the slow path).

Verbatim pack-reuse sounds like a great deal, right? It is, but there are a couple of gotchas that impose a couple of restrictions on how often Git can make use of this optimization:

  • Packfiles cannot contain the same object more than once. For single-pack reuse, this is easy enough (since the pack we’re reusing from also can’t contain duplicate copies of an object), but it makes implementing multi-pack reuse difficult.
  • Certain kinds of deltas (which identify their base by the number of bytes between the delta and base) need to be “patched” if there is an omitted section between the delta and its base, changing the offset.

In order to take full advantage of verbatim pack-reuse, a repository needs to have a majority of its objects packed together in a single packfile. For many repositories, this isn’t a huge deal, but it can become prohibitively expensive for large repositories with many hundreds of millions of objects.

Git 2.44 ships with new support for reusing objects across multiple packs. When using a multi-pack index with reachability bitmaps (for more about these, check out our post, Scaling monorepo maintenance), Git can now take advantage of this optimization across multiple packs, eliminating the need to repack your repository into a single pack.

We’ll cover the precise details in a future blog post dedicated to multi-pack reuse. For now, you might notice a new line of output in your terminal the next time you push to GitHub:

$ git push
Enumerating objects: 350175, done.
Counting objects: 100% (832/832), done.
Compressing objects: 100% (132/132), done.
Total 350175 (delta 735), reused 700 (delta 700), pack-reused 349343 (from 36)
[...]

Notice instead of just pack-reused, we get an extra piece of information next to it ((from 36)), indicating the number of packs from which objects were reused.

To try this out yourself, upgrade your local installation of Git, and run

$ git config --global pack.allowPackReuse multi
$ git multi-pack-index write --bitmap

before the next time you push to GitHub.

[source]

Faster rebases (and much more) with git replay

If you’ve read this series before, you’re no doubt familiar with our coverage of merge-ort, a recent development in Git that is a from-scratch rewrite of the merging backend. If you’re a newcomer to this series (first of all, welcome!), our coverage beginning in our Highlights from Git 2.33 is a great place to start.

merge-ort was introduced almost a dozen Git versions ago and aimed to solve several long-standing issues with its predecessor, the recursive backend. The recursive backend was notoriously difficult to modify, and had difficulty performing well when dealing with merges that involve a large number of renames.

The merge-ort backend was introduced to address these issues, by providing a structured implementation that was correct (with respect to the existing behavior, making it a drop-in replacement for the existing backend), performant, and easy to change. In Git 2.34 (for those interested, our coverage begins here), merge-ort became the default merging backend, meaning that if you’re running Git 2.34 or newer and don’t have any special configuration, you’re almost certainly already making use of merge-ort. Modern versions of Git use the merge-ort backend to resolve conflicts between files on either side of a merge or rebase. With merge-ort in place and widely used, merges and rebases could be computed significantly faster.

But merge-ort also makes it possible to compute merges and rebases without requiring that you have a fully populated checkout of your repository. To perform merges, the merge-tree command command used the --write-tree option to compute merges with merge-ort without requiring a checked out version of your repository.

Rebases were a different story. The existing git rebase sub-command comes with a lot of historical design decisions and assumptions that would make integrating it with merge-ort less than straightforward, and would hinder performance without breaking backwards compatibility guarantees2.

git replay exists to address these challenges. It offers an alternative to git rebase that, in addition to being far more performant:

  • Can operate in bare repositories.
  • Can rebase branches other than the currently checked-out one (in non-bare repositories).
  • Can operate over multiple branches simultaneously.

and much more. GitHub has been using merge-ort for more than a year to power all merges (and more recently, all rebases) performed on GitHub.com, and it has brought substantial performance improvements to both operations.

You might find git replay useful if you’re scripting around in a repository, interested in eeking out performance gains relative to git rebase, or are just interested in playing around with the latest and greatest developments in the Git project. Regardless of which camp you’re in, you can learn more about git replay here.

[source]


  • While we’re on the topic of rebases, let’s talk about --autosquash. In case you’ve never used that option before, don’t worry; here’s a quick introduction. When rebasing, Git will try to combine commits whose subject line begins with fixup! [...], squash! [...], or amend! [...], where the [...] is the log message of some other commit. Git will pair these up and reorganize the todo list to put the fixup! [...] commits (etc.) next to their non-fixup! counterparts.

    Depending on the verb, Git will either combine changes, alter the commit message, or merge successive commit messages together, allowing you to easily edit your work.

    However, previous versions of Git only provided functionality for these options when using interactive rebases with git rebase --interactive (or just git rebase -i, for short). If you wrote a fixup! commit (or similar) and wanted to quickly apply it at the right spot in history, you’d have to either: (a) run git rebase -i and close your $EDITOR, or (b) run GIT_SEQUENCE_EDITOR=true git rebase -i.

    In Git 2.44, autosquash-ing now works with non---interactive rebases, meaning that you can do a bare git rebase and apply your fixup!‘s in their respective locations without having to inspect the todo list or munge your GIT_SEQUENCE_EDITOR environment variable.

    [source]

  • If you’ve been using Git for a long time (or are a newcomer), you’ve probably seen a message beginning with hint:, like so:

    hint: Updates were rejected because the tag already exists in the remote. 
    hint: Disable this message with "git config advice.pushAlreadyExists false" 
    

    Like the hint suggests, you can run git config advice.pushAlreadyExists false to tell Git to avoid showing you the message. But what if you find the advice useful? Perhaps you want to be warned (for example) when attempting to push a tag without --force to a remote which already has a tag by that same name. When that’s the case, you likely don’t want to also see the “Disable this message with […]” portion of the hint.

    In Git 2.44, you can now set git config advice.pushAlreadyExists true to indicate that you want to receive that hint, and Git will continue to show it to you, suppressing the “Disable this message with […]” portion of the message.

    [source]

  • Quick quiz: what does the --no-sort option do when given to git for-each-ref? If you thought, “surely it doesn’t list all references in a non-alphabetical order,” then congratulations, you’re a veteran Git user!

    Despite its name --no-sort provided the output of git for-each-ref in a sorted order, making it unable to take advantage of certain optimizations that assume an arbitrary ordering.

    For those interested in the technical details, you can learn more in the patches linked below. If you just want the numbers, you’re in luck: on my machine, git for-each-ref --no-sort outperforms a bog-standard git for-each-ref by more than 20% on a repository with a large number of references.

    [source]

  • If you’ve spent much time pursuing the Git documentation, you’ve likely encountered the term “pathspec”, and perhaps wondered what it meant. In Git parlance, “pathspec” roughly corresponds to “ways to limit filepaths” when used in conjunction with a Git command.

    There are lots of examples in the documentation, but some notable ones include: git show ':^Documentation/' (meaning, “show me the last commit, excluding any changes in the Documentation directory”), git show ':(icase)**/*sha256*' (meaning, “show me files with ‘sha256’ in their path, regardless of casing”), and git show ':(attr:~binary)' (meaning, “show me files which do not have their binary attribute set via .gitattributes“).

    In Git 2.44, git add now understands the attr pathspec magic, meaning that you can do things like git add ':(attr:~binary)' to stage all text/non-binary files in the index.

    Git 2.44 also introduces a new pathspec attribute, called builtin_objectmode. This new pathspec magic allows filtering paths by their mode (for example, 100644 for non-executable files, 100755 for executable ones, 160000 for submodules, etc.). The builtin_ prefix indicates that you can use this pathspec magic without needing to set any values in your .gitattributes file(s), meaning that you can do things like git add ':(builtin_objectmode=100755)' to add all executable files in your working copy.

    [source, source]

The whole shebang

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.44, or any previous version in the Git repository.

Notes


  1. If you’re reading this blog post (especially the footnotes!) there’s a pretty good chance that you have. 
  2. For those curious, an extensive discussion on why git replay was used instead of extending git rebase can be found on the mailing list here

The post Highlights from Git 2.44 appeared first on The GitHub Blog.

The first Git release of 2024 is here! Take a look at some of our highlights on what's new in Git 2.44.

The post Highlights from Git 2.44 appeared first on The GitHub Blog.