[00/14] refs: batch refname availability checks

Message ID	20250217-pks-update-ref-optimization-v1-0-a2b6d87a24af@pks.im (mailing list archive)
Headers	show Received: from fhigh-b1-smtp.messagingengine.com (fhigh-b1-smtp.messagingengine.com [202.12.124.152]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E808231A55 for <git@vger.kernel.org>; Mon, 17 Feb 2025 15:50:26 +0000 (UTC) Feedback-ID: i197146af:Fastmail From: Patrick Steinhardt <ps@pks.im> Subject: [PATCH 00/14] refs: batch refname availability checks Date: Mon, 17 Feb 2025 16:50:14 +0100 Message-Id: <20250217-pks-update-ref-optimization-v1-0-a2b6d87a24af@pks.im> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: git@vger.kernel.org Cc: Karthik Nayak <karthik.188@gmail.com>, "brian m. carlson" <sandals@crustytoothpaste.net>, Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>, Christian Couder <chriscool@tuxfamily.org>
Series	refs: batch refname availability checks \| expand [00/14] refs: batch refname availability checks [01/14] object-name: introduce `repo_get_oid_with_flags()` [02/14] object-name: allow skipping ambiguity checks in `get_oid()` family [03/14] builtin/update-ref: skip ambiguity checks when parsing object IDs [04/14] refs: introduce function to batch refname availability checks [05/14] refs/reftable: start using `refs_verify_refnames_available()` [06/14] refs: stop re-verifying common prefixes for availability [07/14] refs/iterator: separate lifecycle from iteration [08/14] refs/iterator: provide infrastructure to re-seek iterators [09/14] refs/iterator: implement seeking for merged iterators [10/14] refs/iterator: implement seeking for reftable iterators [11/14] refs/iterator: implement seeking for ref-cache iterators [12/14] refs/iterator: implement seeking for `packed-ref` iterators [13/14] refs/iterator: implement seeking for "files" iterators [14/14] refs: reuse iterators when determining refname availability

Message ID

20250217-pks-update-ref-optimization-v1-0-a2b6d87a24af@pks.im (mailing list archive)

Headers

Feedback-ID: i197146af:Fastmail
From: Patrick Steinhardt <ps@pks.im>
Subject: [PATCH 00/14] refs: batch refname availability checks
Date: Mon, 17 Feb 2025 16:50:14 +0100
Message-Id: <20250217-pks-update-ref-optimization-v1-0-a2b6d87a24af@pks.im>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
To: git@vger.kernel.org
Cc: Karthik Nayak <karthik.188@gmail.com>,
 "brian m. carlson" <sandals@crustytoothpaste.net>,
 Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>,
 Christian Couder <chriscool@tuxfamily.org>

Series

refs: batch refname availability checks | expand

Message

Patrick Steinhardt Feb. 17, 2025, 3:50 p.m. UTC

Hi,

this patch series has been inspired by brian's report that the reftable
backend is significantly slower when writing many references compared to
the files backend. As explained in that thread, the underlying issue is
the design of tombstone references: when we first delete all references
in a repository and then recreate them, we still have all the tombstones
and thus we need to churn through all of them to figure out that they
have been deleted in the first place. The files backend does not have
this issue.

I consider the benchmark itself to be kind of broken, as it stems from
us deleting all refs and then recreating them. And if you pack refs in
between then the "reftable" backend outperforms the "files" backend.

But there are a couple of opportunities here anyway. While we cannot
make the underlying issue of tombstones being less efficient go away,
this has prompted me to have a deeper look at where we spend all the
time. There are three ideas in this series:

  - git-update-ref(1) performs ambiguity checks for any full-size object
    ID, which triggers a lot of reads. This is somewhat pointless though
    given that the manpage explicitly points out that the command is
    about object IDs, even though it does know to parse refs. But being
    part of plumbing, emitting the warning here does not make a ton of
    sense, and favoring object IDs over references in these cases is the
    obvious thing to do anyway.

  - For each ref "refs/heads/bar", we need to verify that neither
    "refs/heads" nor "refs" exists. This was repeated for every refname,
    but because most refnames use common prefixes this made us re-check
    a lot of prefixes. This is addressed by using a `strset` of already
    checked prefixes.

  - For each ref "refs/heads/bar", we need to verify that no ref
    "refs/heads/bar/*" exists. We always created a new ref iterator for
    this check, which requires us to discard all internal state and then
    recreate it. The reftable library has already been refactored though
    to have reseekable iterators, so we backfill this functionality to
    all the other iterators and then reuse the iterator.

With the (somewhat broken) benchmark we small speedup with the "files"
backend:

    Benchmark 1: update-ref (refformat = files, revision = master)
      Time (mean ± σ):     233.8 ms ±   1.4 ms    [User: 81.2 ms, System: 151.5 ms]
      Range (min … max):   232.2 ms … 236.0 ms    10 runs

    Benchmark 2: update-ref (refformat = files, revision = HEAD)
      Time (mean ± σ):     192.3 ms ±   1.5 ms    [User: 67.2 ms, System: 124.0 ms]
      Range (min … max):   190.1 ms … 194.2 ms    10 runs

    Summary
      update-ref (refformat = files, revision = HEAD) ran
        1.22 ± 0.01 times faster than update-ref (refformat = files, revision = master)

And a huge speedup with the "reftable" backend:

    Benchmark 1: update-ref (refformat = reftable, revision = master)
      Time (mean ± σ):     16.852 s ±  0.061 s    [User: 16.754 s, System: 0.059 s]
      Range (min … max):   16.785 s … 16.982 s    10 runs

    Benchmark 2: update-ref (refformat = reftable, revision = HEAD)
      Time (mean ± σ):      2.230 s ±  0.009 s    [User: 2.192 s, System: 0.029 s]
      Range (min … max):    2.215 s …  2.244 s    10 runs

    Summary
      update-ref (refformat = reftable, revision = HEAD) ran
        7.56 ± 0.04 times faster than update-ref (refformat = reftable, revision = master)

We're still not up to speed with the "files" backend, but considerably
better. Given that this is an extreme edge case and not reflective of
the general case I'm okay with this result for now.

But more importantly, this refactoring also has a positive effect when
updating references in a repository with preexisting refs, which I
consider to be the more realistic scenario. The following benchmark
creates 10k refs with 100k preexisting refs.

With the "files" backend we see a modest improvement:

    Benchmark 1: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = master)
      Time (mean ± σ):     470.1 ms ±   5.4 ms    [User: 104.5 ms, System: 363.1 ms]
      Range (min … max):   465.7 ms … 484.3 ms    10 runs

    Benchmark 2: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD)
      Time (mean ± σ):     407.8 ms ±   5.4 ms    [User: 66.0 ms, System: 340.0 ms]
      Range (min … max):   399.9 ms … 417.6 ms    10 runs

    Summary
      update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD) ran
        1.15 ± 0.02 times faster than update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = master)

But with the "reftable" backend we see an almost 5x improvement, where
it's now ~15x faster than the "files" backend:

    Benchmark 1: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = master)
      Time (mean ± σ):     153.9 ms ±   2.0 ms    [User: 96.5 ms, System: 56.6 ms]
      Range (min … max):   150.5 ms … 158.4 ms    18 runs

    Benchmark 2: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD)
      Time (mean ± σ):      32.2 ms ±   1.2 ms    [User: 27.6 ms, System: 4.3 ms]
      Range (min … max):    29.8 ms …  38.6 ms    71 runs

    Summary
      update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD) ran
        4.78 ± 0.19 times faster than update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = master)

The series is structured as follows:

  - Patches 1 to 4 implement the logic to skip ambiguity checks in
    git-update-ref(1).

  - Patch 5 and 6 introduce batched checks.

  - Patch 7 deduplicates the ref prefix checks.

  - Patch 8 to 14 implement the infrastructure to reseek iterators.

  - Patch 15 starts to reuse iterators for nested ref checks.

Thanks!

Patrick

[1]: <Z602dzQggtDdcgCX@tapette.crustytoothpaste.net>

---
Patrick Steinhardt (14):
      object-name: introduce `repo_get_oid_with_flags()`
      object-name: allow skipping ambiguity checks in `get_oid()` family
      builtin/update-ref: skip ambiguity checks when parsing object IDs
      refs: introduce function to batch refname availability checks
      refs/reftable: start using `refs_verify_refnames_available()`
      refs: stop re-verifying common prefixes for availability
      refs/iterator: separate lifecycle from iteration
      refs/iterator: provide infrastructure to re-seek iterators
      refs/iterator: implement seeking for merged iterators
      refs/iterator: implement seeking for reftable iterators
      refs/iterator: implement seeking for ref-cache iterators
      refs/iterator: implement seeking for `packed-ref` iterators
      refs/iterator: implement seeking for "files" iterators
      refs: reuse iterators when determining refname availability

 builtin/clone.c              |   2 +
 builtin/update-ref.c         |  12 ++-
 dir-iterator.c               |  24 +++---
 dir-iterator.h               |  13 +--
 hash.h                       |   1 +
 object-name.c                |  18 +++--
 object-name.h                |   6 ++
 refs.c                       | 186 ++++++++++++++++++++++++++-----------------
 refs.h                       |  12 +++
 refs/debug.c                 |  20 +++--
 refs/files-backend.c         |  52 ++++++------
 refs/iterator.c              | 145 +++++++++++++++++----------------
 refs/packed-backend.c        |  89 ++++++++++++---------
 refs/ref-cache.c             |  83 +++++++++++--------
 refs/refs-internal.h         |  54 ++++++++-----
 refs/reftable-backend.c      |  85 +++++++++++---------
 t/helper/test-dir-iterator.c |   1 +
 17 files changed, 471 insertions(+), 332 deletions(-)


---
base-commit: e2067b49ecaef9b7f51a17ce251f9207f72ef52d
change-id: 20250217-pks-update-ref-optimization-15c795e66e2b

Comments

brian m. carlson Feb. 18, 2025, 5:10 p.m. UTC | #1

On 2025-02-17 at 15:50:14, Patrick Steinhardt wrote:
> But more importantly, this refactoring also has a positive effect when
> updating references in a repository with preexisting refs, which I
> consider to be the more realistic scenario. The following benchmark
> creates 10k refs with 100k preexisting refs.
> 
> With the "files" backend we see a modest improvement:
> 
>     Benchmark 1: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = master)
>       Time (mean ± σ):     470.1 ms ±   5.4 ms    [User: 104.5 ms, System: 363.1 ms]
>       Range (min … max):   465.7 ms … 484.3 ms    10 runs
> 
>     Benchmark 2: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD)
>       Time (mean ± σ):     407.8 ms ±   5.4 ms    [User: 66.0 ms, System: 340.0 ms]
>       Range (min … max):   399.9 ms … 417.6 ms    10 runs
> 
>     Summary
>       update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD) ran
>         1.15 ± 0.02 times faster than update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = master)
> 
> But with the "reftable" backend we see an almost 5x improvement, where
> it's now ~15x faster than the "files" backend:
> 
>     Benchmark 1: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = master)
>       Time (mean ± σ):     153.9 ms ±   2.0 ms    [User: 96.5 ms, System: 56.6 ms]
>       Range (min … max):   150.5 ms … 158.4 ms    18 runs
> 
>     Benchmark 2: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD)
>       Time (mean ± σ):      32.2 ms ±   1.2 ms    [User: 27.6 ms, System: 4.3 ms]
>       Range (min … max):    29.8 ms …  38.6 ms    71 runs
> 
>     Summary
>       update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD) ran
>         4.78 ± 0.19 times faster than update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = master)

I'm glad to see this performance speedup.  That's a really nice
improvement.

> The series is structured as follows:
> 
>   - Patches 1 to 4 implement the logic to skip ambiguity checks in
>     git-update-ref(1).
> 
>   - Patch 5 and 6 introduce batched checks.
> 
>   - Patch 7 deduplicates the ref prefix checks.
> 
>   - Patch 8 to 14 implement the infrastructure to reseek iterators.
> 
>   - Patch 15 starts to reuse iterators for nested ref checks.

I took a look at this series and I didn't find anything that stood out
to me as a problem.  I will say that the reftable code isn't my forte,
so please don't take this as a formal review, but I am definitely
positive on the series in the general sense.