diff mbox series

[v2,03/16] builtin/update-ref: skip ambiguity checks when parsing object IDs

Message ID 20250219-pks-update-ref-optimization-v2-3-e696e7220b22@pks.im (mailing list archive)
State New
Headers show
Series refs: batch refname availability checks | expand

Commit Message

Patrick Steinhardt Feb. 19, 2025, 1:23 p.m. UTC
Most of the commands in git-update-ref(1) accept an old and/or new
object ID to update a specific reference to. These object IDs get parsed
via `repo_get_oid()`, which not only handles plain object IDs, but also
those that have a suffix like "~" or "^2". More surprisingly though, it
even knows to resolve references, despite the fact that its manpage does
not mention this fact even once.

One consequence of this is that we also check for ambiguous references:
when parsing a full object ID where the DWIM mechanism would also cause
us to resolve it as a branch, we'd end up printing a warning. While this
check makes sense to have in general, it is arguably less useful in the
context of git-update-ref(1). This is out of two reasons:

  - The manpage is explicitly structured around object IDs. So if we see
    a fully blown object ID, the intent should be quite clear in
    general.

  - The command is part of our plumbing layer and not a tool that users
    would generally use in interactive workflows. As such, the warning
    will likely not be visible to anybody in the first place.

Furthermore, this check can be quite expensive when updating lots of
references via `--stdin`, because we try to read multiple references per
object ID that we parse according to the DWIM rules. This effect can be
seen both with the "files" and "reftable" backend.

The issue is not unique to git-update-ref(1), but was also an issue in
git-cat-file(1), where it was addressed by disabling the ambiguity check
in 25fba78d36b (cat-file: disable object/refname ambiguity check for
batch mode, 2013-07-12).

Disable the warning in git-update-ref(1), which provides a significant
speedup with both backends. The following benchmark creates 10000 new
references with a 100000 preexisting refs with the "files" backend:

    Benchmark 1: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD~)
      Time (mean ± σ):     467.3 ms ±   5.1 ms    [User: 100.0 ms, System: 365.1 ms]
      Range (min … max):   461.9 ms … 479.3 ms    10 runs

    Benchmark 2: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD)
      Time (mean ± σ):     394.1 ms ±   5.8 ms    [User: 63.3 ms, System: 327.6 ms]
      Range (min … max):   384.9 ms … 405.7 ms    10 runs

    Summary
      update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD) ran
        1.19 ± 0.02 times faster than update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD~)

And with the "reftable" backend:

    Benchmark 1: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD~)
      Time (mean ± σ):     146.9 ms ±   2.2 ms    [User: 90.4 ms, System: 56.0 ms]
      Range (min … max):   142.7 ms … 150.8 ms    19 runs

    Benchmark 2: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD)
      Time (mean ± σ):      63.2 ms ±   1.1 ms    [User: 41.0 ms, System: 21.8 ms]
      Range (min … max):    61.1 ms …  66.6 ms    41 runs

    Summary
      update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD) ran
        2.32 ± 0.05 times faster than update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD~)

Note that the absolute improvement with both backends is roughly in the
same ballpark, but the relative improvement for the "reftable" backend
is more significant because writing the new table to disk is faster in
the first place.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/update-ref.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Justin Tobler Feb. 19, 2025, 6:21 p.m. UTC | #1
On 25/02/19 02:23PM, Patrick Steinhardt wrote:
> Most of the commands in git-update-ref(1) accept an old and/or new
> object ID to update a specific reference to. These object IDs get parsed
> via `repo_get_oid()`, which not only handles plain object IDs, but also
> those that have a suffix like "~" or "^2". More surprisingly though, it
> even knows to resolve references, despite the fact that its manpage does
> not mention this fact even once.
> 
> One consequence of this is that we also check for ambiguous references:
> when parsing a full object ID where the DWIM mechanism would also cause
> us to resolve it as a branch, we'd end up printing a warning. While this
> check makes sense to have in general, it is arguably less useful in the
> context of git-update-ref(1). This is out of two reasons:
> 
>   - The manpage is explicitly structured around object IDs. So if we see
>     a fully blown object ID, the intent should be quite clear in
>     general.

Makes sense.

>   - The command is part of our plumbing layer and not a tool that users
>     would generally use in interactive workflows. As such, the warning
>     will likely not be visible to anybody in the first place.

Ok, so in many cases already the warning is not propagated which makes
its computation wasteful to begin with.

> Furthermore, this check can be quite expensive when updating lots of
> references via `--stdin`, because we try to read multiple references per
> object ID that we parse according to the DWIM rules. This effect can be
> seen both with the "files" and "reftable" backend.
> 
> The issue is not unique to git-update-ref(1), but was also an issue in
> git-cat-file(1), where it was addressed by disabling the ambiguity check
> in 25fba78d36b (cat-file: disable object/refname ambiguity check for
> batch mode, 2013-07-12).
> 
> Disable the warning in git-update-ref(1), which provides a significant
> speedup with both backends. The following benchmark creates 10000 new
> references with a 100000 preexisting refs with the "files" backend:
> 
>     Benchmark 1: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD~)
>       Time (mean ± σ):     467.3 ms ±   5.1 ms    [User: 100.0 ms, System: 365.1 ms]
>       Range (min … max):   461.9 ms … 479.3 ms    10 runs
> 
>     Benchmark 2: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD)
>       Time (mean ± σ):     394.1 ms ±   5.8 ms    [User: 63.3 ms, System: 327.6 ms]
>       Range (min … max):   384.9 ms … 405.7 ms    10 runs
> 
>     Summary
>       update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD) ran
>         1.19 ± 0.02 times faster than update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD~)
> 
> And with the "reftable" backend:
> 
>     Benchmark 1: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD~)
>       Time (mean ± σ):     146.9 ms ±   2.2 ms    [User: 90.4 ms, System: 56.0 ms]
>       Range (min … max):   142.7 ms … 150.8 ms    19 runs
> 
>     Benchmark 2: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD)
>       Time (mean ± σ):      63.2 ms ±   1.1 ms    [User: 41.0 ms, System: 21.8 ms]
>       Range (min … max):    61.1 ms …  66.6 ms    41 runs
> 
>     Summary
>       update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD) ran
>         2.32 ± 0.05 times faster than update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD~)
> 
> Note that the absolute improvement with both backends is roughly in the
> same ballpark, but the relative improvement for the "reftable" backend
> is more significant because writing the new table to disk is faster in
> the first place.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  builtin/update-ref.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/builtin/update-ref.c b/builtin/update-ref.c
> index 4d35bdc4b4b..d603f54b770 100644
> --- a/builtin/update-ref.c
> +++ b/builtin/update-ref.c
> @@ -179,7 +179,8 @@ static int parse_next_oid(const char **next, const char *end,
>  		(*next)++;
>  		*next = parse_arg(*next, &arg);
>  		if (arg.len) {
> -			if (repo_get_oid(the_repository, arg.buf, oid))
> +			if (repo_get_oid_with_flags(the_repository, arg.buf, oid,
> +						    GET_OID_SKIP_AMBIGUITY_CHECK))
>  				goto invalid;
>  		} else {
>  			/* Without -z, an empty value means all zeros: */
> @@ -197,7 +198,8 @@ static int parse_next_oid(const char **next, const char *end,
>  		*next += arg.len;
>  
>  		if (arg.len) {
> -			if (repo_get_oid(the_repository, arg.buf, oid))
> +			if (repo_get_oid_with_flags(the_repository, arg.buf, oid,
> +						    GET_OID_SKIP_AMBIGUITY_CHECK))
>  				goto invalid;
>  		} else if (flags & PARSE_SHA1_ALLOW_EMPTY) {
>  			/* With -z, treat an empty value as all zeros: */
> @@ -772,7 +774,8 @@ int cmd_update_ref(int argc,
>  		refname = argv[0];
>  		value = argv[1];
>  		oldval = argv[2];
> -		if (repo_get_oid(the_repository, value, &oid))
> +		if (repo_get_oid_with_flags(the_repository, value, &oid,
> +					    GET_OID_SKIP_AMBIGUITY_CHECK))
>  			die("%s: not a valid SHA1", value);
>  	}
>  
> @@ -783,7 +786,8 @@ int cmd_update_ref(int argc,
>  			 * must not already exist:
>  			 */
>  			oidclr(&oldoid, the_repository->hash_algo);
> -		else if (repo_get_oid(the_repository, oldval, &oldoid))
> +		else if (repo_get_oid_with_flags(the_repository, oldval, &oldoid,
> +						 GET_OID_SKIP_AMBIGUITY_CHECK))
>  			die("%s: not a valid old SHA1", oldval);
>  	}

In builtin/update-ref.c all uses of repo_get_oid() have been converted
to repo_get_oid_with_flags() with the GET_OID_SKIP_AMBIGUITY_CHECK flag
except for one in parse_cmd_symref_update(). Is there reason to leave
that one untouched?

-Justin
Patrick Steinhardt Feb. 20, 2025, 8:05 a.m. UTC | #2
On Wed, Feb 19, 2025 at 12:21:44PM -0600, Justin Tobler wrote:
> > diff --git a/builtin/update-ref.c b/builtin/update-ref.c
> > index 4d35bdc4b4b..d603f54b770 100644
> > --- a/builtin/update-ref.c
> > +++ b/builtin/update-ref.c
> > @@ -179,7 +179,8 @@ static int parse_next_oid(const char **next, const char *end,
> > @@ -783,7 +786,8 @@ int cmd_update_ref(int argc,
> >  			 * must not already exist:
> >  			 */
> >  			oidclr(&oldoid, the_repository->hash_algo);
> > -		else if (repo_get_oid(the_repository, oldval, &oldoid))
> > +		else if (repo_get_oid_with_flags(the_repository, oldval, &oldoid,
> > +						 GET_OID_SKIP_AMBIGUITY_CHECK))
> >  			die("%s: not a valid old SHA1", oldval);
> >  	}
> 
> In builtin/update-ref.c all uses of repo_get_oid() have been converted
> to repo_get_oid_with_flags() with the GET_OID_SKIP_AMBIGUITY_CHECK flag
> except for one in parse_cmd_symref_update(). Is there reason to leave
> that one untouched?

Ah, no, this was a mere oversight. Good catch, fixed.

Patrick
diff mbox series

Patch

diff --git a/builtin/update-ref.c b/builtin/update-ref.c
index 4d35bdc4b4b..d603f54b770 100644
--- a/builtin/update-ref.c
+++ b/builtin/update-ref.c
@@ -179,7 +179,8 @@  static int parse_next_oid(const char **next, const char *end,
 		(*next)++;
 		*next = parse_arg(*next, &arg);
 		if (arg.len) {
-			if (repo_get_oid(the_repository, arg.buf, oid))
+			if (repo_get_oid_with_flags(the_repository, arg.buf, oid,
+						    GET_OID_SKIP_AMBIGUITY_CHECK))
 				goto invalid;
 		} else {
 			/* Without -z, an empty value means all zeros: */
@@ -197,7 +198,8 @@  static int parse_next_oid(const char **next, const char *end,
 		*next += arg.len;
 
 		if (arg.len) {
-			if (repo_get_oid(the_repository, arg.buf, oid))
+			if (repo_get_oid_with_flags(the_repository, arg.buf, oid,
+						    GET_OID_SKIP_AMBIGUITY_CHECK))
 				goto invalid;
 		} else if (flags & PARSE_SHA1_ALLOW_EMPTY) {
 			/* With -z, treat an empty value as all zeros: */
@@ -772,7 +774,8 @@  int cmd_update_ref(int argc,
 		refname = argv[0];
 		value = argv[1];
 		oldval = argv[2];
-		if (repo_get_oid(the_repository, value, &oid))
+		if (repo_get_oid_with_flags(the_repository, value, &oid,
+					    GET_OID_SKIP_AMBIGUITY_CHECK))
 			die("%s: not a valid SHA1", value);
 	}
 
@@ -783,7 +786,8 @@  int cmd_update_ref(int argc,
 			 * must not already exist:
 			 */
 			oidclr(&oldoid, the_repository->hash_algo);
-		else if (repo_get_oid(the_repository, oldval, &oldoid))
+		else if (repo_get_oid_with_flags(the_repository, oldval, &oldoid,
+						 GET_OID_SKIP_AMBIGUITY_CHECK))
 			die("%s: not a valid old SHA1", oldval);
 	}