Message ID | 20250225-pks-update-ref-optimization-v3-3-77c3687cda75@pks.im (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | refs: batch refname availability checks | expand |
Patrick Steinhardt <ps@pks.im> writes: > Most of the commands in git-update-ref(1) accept an old and/or new > object ID to update a specific reference to. These object IDs get parsed > via `repo_get_oid()`, which not only handles plain object IDs, but also > those that have a suffix like "~" or "^2". More surprisingly though, it > even knows to resolve references, despite the fact that its manpage does > not mention this fact even once. Are you referring to <new-oid> and other placeholders with "oid" in their names? I do think "oid" in our documentation implies that only full hexadecimal object names are allowed. The glossary agrees by saying that <object id> is a synonym for <object name> that is usually 40-hex SHA-1. However, that is not strictly enforced and we say <object> (or its typed variants like <commit-ish>) even when a command takes any extended SHA-1 expression, as described in Documentation/revisions.{txt,adoc}, not limited to full hexadecimal object name. So, I am somewhat sympathetic to your confusion, but not that much. When we wrote the command and documented it back in 2005, we did mean to take any object name that is spelled in any way, not just full hexadecimal. You may want to update the manual to emphasize that we encourage the use of full hexadecimal for this command and elsewhere where it is more appropriate. > One consequence of this is that we also check for ambiguous references: > when parsing a full object ID where the DWIM mechanism would also cause > us to resolve it as a branch, we'd end up printing a warning. While this > check makes sense to have in general, it is arguably less useful in the > context of git-update-ref(1). > > - The manpage is explicitly structured around object IDs. So if we see > a fully blown object ID, the intent should be quite clear in > general. > > - The command is part of our plumbing layer and not a tool that users > would generally use in interactive workflows. As such, the warning > will likely not be visible to anybody in the first place. In addition, if the user meant to refer to a ref, it is possible to disambiguate by prefixing refs/tags/ or whatever. So squelching the warning unconditionally might make sense. We will yield the value of the full hexadecimal object name, instead of the value of the ref that is confusingly named, so there is no material change in the behaviour here. OK.
On Wed, Feb 26, 2025 at 02:26:59PM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > Most of the commands in git-update-ref(1) accept an old and/or new > > object ID to update a specific reference to. These object IDs get parsed > > via `repo_get_oid()`, which not only handles plain object IDs, but also > > those that have a suffix like "~" or "^2". More surprisingly though, it > > even knows to resolve references, despite the fact that its manpage does > > not mention this fact even once. > > Are you referring to <new-oid> and other placeholders with "oid" in > their names? I do think "oid" in our documentation implies that > only full hexadecimal object names are allowed. The glossary agrees > by saying that <object id> is a synonym for <object name> that is > usually 40-hex SHA-1. However, that is not strictly enforced and we > say <object> (or its typed variants like <commit-ish>) even when a > command takes any extended SHA-1 expression, as described in > Documentation/revisions.{txt,adoc}, not limited to full hexadecimal > object name. > > So, I am somewhat sympathetic to your confusion, but not that much. > When we wrote the command and documented it back in 2005, we did > mean to take any object name that is spelled in any way, not just > full hexadecimal. You may want to update the manual to emphasize > that we encourage the use of full hexadecimal for this command and > elsewhere where it is more appropriate. Yeah. I have been aware of the behaviour beforehand, but an unsuspecting user that reads through the manpage wouldn't be able to figure out at all that this is the case. I guess this is something we should improve, but I think it's outside of the scope of this series. #leftoverbits Patrick
diff --git a/builtin/update-ref.c b/builtin/update-ref.c index 4d35bdc4b4b..1d541e13ade 100644 --- a/builtin/update-ref.c +++ b/builtin/update-ref.c @@ -179,7 +179,8 @@ static int parse_next_oid(const char **next, const char *end, (*next)++; *next = parse_arg(*next, &arg); if (arg.len) { - if (repo_get_oid(the_repository, arg.buf, oid)) + if (repo_get_oid_with_flags(the_repository, arg.buf, oid, + GET_OID_SKIP_AMBIGUITY_CHECK)) goto invalid; } else { /* Without -z, an empty value means all zeros: */ @@ -197,7 +198,8 @@ static int parse_next_oid(const char **next, const char *end, *next += arg.len; if (arg.len) { - if (repo_get_oid(the_repository, arg.buf, oid)) + if (repo_get_oid_with_flags(the_repository, arg.buf, oid, + GET_OID_SKIP_AMBIGUITY_CHECK)) goto invalid; } else if (flags & PARSE_SHA1_ALLOW_EMPTY) { /* With -z, treat an empty value as all zeros: */ @@ -299,7 +301,8 @@ static void parse_cmd_symref_update(struct ref_transaction *transaction, die("symref-update %s: expected old value", refname); if (!strcmp(old_arg, "oid")) { - if (repo_get_oid(the_repository, old_target, &old_oid)) + if (repo_get_oid_with_flags(the_repository, old_target, &old_oid, + GET_OID_SKIP_AMBIGUITY_CHECK)) die("symref-update %s: invalid oid: %s", refname, old_target); have_old_oid = 1; @@ -772,7 +775,8 @@ int cmd_update_ref(int argc, refname = argv[0]; value = argv[1]; oldval = argv[2]; - if (repo_get_oid(the_repository, value, &oid)) + if (repo_get_oid_with_flags(the_repository, value, &oid, + GET_OID_SKIP_AMBIGUITY_CHECK)) die("%s: not a valid SHA1", value); } @@ -783,7 +787,8 @@ int cmd_update_ref(int argc, * must not already exist: */ oidclr(&oldoid, the_repository->hash_algo); - else if (repo_get_oid(the_repository, oldval, &oldoid)) + else if (repo_get_oid_with_flags(the_repository, oldval, &oldoid, + GET_OID_SKIP_AMBIGUITY_CHECK)) die("%s: not a valid old SHA1", oldval); }
Most of the commands in git-update-ref(1) accept an old and/or new object ID to update a specific reference to. These object IDs get parsed via `repo_get_oid()`, which not only handles plain object IDs, but also those that have a suffix like "~" or "^2". More surprisingly though, it even knows to resolve references, despite the fact that its manpage does not mention this fact even once. One consequence of this is that we also check for ambiguous references: when parsing a full object ID where the DWIM mechanism would also cause us to resolve it as a branch, we'd end up printing a warning. While this check makes sense to have in general, it is arguably less useful in the context of git-update-ref(1). This is out of two reasons: - The manpage is explicitly structured around object IDs. So if we see a fully blown object ID, the intent should be quite clear in general. - The command is part of our plumbing layer and not a tool that users would generally use in interactive workflows. As such, the warning will likely not be visible to anybody in the first place. Furthermore, this check can be quite expensive when updating lots of references via `--stdin`, because we try to read multiple references per object ID that we parse according to the DWIM rules. This effect can be seen both with the "files" and "reftable" backend. The issue is not unique to git-update-ref(1), but was also an issue in git-cat-file(1), where it was addressed by disabling the ambiguity check in 25fba78d36b (cat-file: disable object/refname ambiguity check for batch mode, 2013-07-12). Disable the warning in git-update-ref(1), which provides a significant speedup with both backends. The following benchmark creates 10000 new references with a 100000 preexisting refs with the "files" backend: Benchmark 1: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD~) Time (mean ± σ): 467.3 ms ± 5.1 ms [User: 100.0 ms, System: 365.1 ms] Range (min … max): 461.9 ms … 479.3 ms 10 runs Benchmark 2: update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD) Time (mean ± σ): 394.1 ms ± 5.8 ms [User: 63.3 ms, System: 327.6 ms] Range (min … max): 384.9 ms … 405.7 ms 10 runs Summary update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD) ran 1.19 ± 0.02 times faster than update-ref: create many refs (refformat = files, preexisting = 100000, new = 10000, revision = HEAD~) And with the "reftable" backend: Benchmark 1: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD~) Time (mean ± σ): 146.9 ms ± 2.2 ms [User: 90.4 ms, System: 56.0 ms] Range (min … max): 142.7 ms … 150.8 ms 19 runs Benchmark 2: update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD) Time (mean ± σ): 63.2 ms ± 1.1 ms [User: 41.0 ms, System: 21.8 ms] Range (min … max): 61.1 ms … 66.6 ms 41 runs Summary update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD) ran 2.32 ± 0.05 times faster than update-ref: create many refs (refformat = reftable, preexisting = 100000, new = 10000, revision = HEAD~) Note that the absolute improvement with both backends is roughly in the same ballpark, but the relative improvement for the "reftable" backend is more significant because writing the new table to disk is faster in the first place. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- builtin/update-ref.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)