[RESEND] refs: Always pass old object name to reftx hook

Message ID	ae5c3b2b783f912a02b26142ecd753bf92530d2f.1610974040.git.ps@pks.im (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 18 Jan 2021 13:49:05 +0100 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Cc: peff@peff.net, me@ttaylorr.com, gitster@pobox.com Subject: [PATCH RESEND] refs: Always pass old object name to reftx hook Message-ID: <ae5c3b2b783f912a02b26142ecd753bf92530d2f.1610974040.git.ps@pks.im> References: <d255c7a5f95635c2e7ae36b9689c3efd07b4df5d.1604501894.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="iIa9f+Fa0ope8NUB" Content-Disposition: inline In-Reply-To: <d255c7a5f95635c2e7ae36b9689c3efd07b4df5d.1604501894.git.ps@pks.im> Precedence: bulk
Series	[RESEND] refs: Always pass old object name to reftx hook \| expand [RESEND] refs: Always pass old object name to reftx hook

Patrick Steinhardt Jan. 18, 2021, 12:49 p.m. UTC

Inputs of the reference-transaction hook currently depends on the
command which is being run. For example if the command `git update-ref
$REF $A $B` is executed, it will receive "$B $A $REF" as input, but if
the command `git update-ref $REF $A` is executed without providing the
old value, then it will receive "0*40 $A $REF" as input. This is due to
the fact that we directly write queued transaction updates into the
hook's standard input, which will not contain the old object value in
case it wasn't provided.

While this behaviour reflects what is happening as part of the
repository, it doesn't feel like it is useful. The main intent of the
reference-transaction hook is to be able to completely audit all
reference updates, no matter where they come from. As such, it makes a
lot more sense to always provide actual values instead of what the user
wanted. Furthermore, it's impossible for the hook to distinguish whether
this is intended to be a branch creation or a branch update without
doing additional digging with the current format.

Fix the issue by storing the old object value into the queued
transaction update operation if it wasn't provided by the caller.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Documentation/githooks.txt       |  6 ++++++
 refs/files-backend.c             |  8 ++++++++
 refs/packed-backend.c            |  2 ++
 t/t1416-ref-transaction-hooks.sh | 12 ++++++------
 4 files changed, 22 insertions(+), 6 deletions(-)

Junio C Hamano Jan. 18, 2021, 10:45 p.m. UTC | #1

Patrick Steinhardt <ps@pks.im> writes:

> Inputs of the reference-transaction hook currently depends on the
> command which is being run. For example if the command `git update-ref
> $REF $A $B` is executed, it will receive "$B $A $REF" as input, but if
> the command `git update-ref $REF $A` is executed without providing the
> old value, then it will receive "0*40 $A $REF" as input. This is due to
> the fact that we directly write queued transaction updates into the
> hook's standard input, which will not contain the old object value in
> case it wasn't provided.

In effect, the user says "I do not care if this update races with
somebody else and it is perfectly OK if it overwrites their update"
by not giving $B.

> While this behaviour reflects what is happening as part of the
> repository, it doesn't feel like it is useful. The main intent of the
> reference-transaction hook is to be able to completely audit all
> reference updates, no matter where they come from. As such, it makes a
> lot more sense to always provide actual values instead of what the user
> wanted. Furthermore, it's impossible for the hook to distinguish whether
> this is intended to be a branch creation or a branch update without
> doing additional digging with the current format.

But shouldn't the transaction hook script be allowed to learn the
end-user intention and behave differently?  If we replace the
missing old object before calling the script, wouldn't it lose
information?

The above is not an objection posed as two rhetoric questions.  I am
purely curious why losing information is OK in this case, or why it
may not be so OK but should still be acceptable because it is lessor
evil than giving 0{40} to the hooks.

Even without this change, the current value the hook can learn by
looking the ref up itself if it really wanted to, no?

Patrick Steinhardt Jan. 20, 2021, 6:28 a.m. UTC | #2

On Mon, Jan 18, 2021 at 02:45:30PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Inputs of the reference-transaction hook currently depends on the
> > command which is being run. For example if the command `git update-ref
> > $REF $A $B` is executed, it will receive "$B $A $REF" as input, but if
> > the command `git update-ref $REF $A` is executed without providing the
> > old value, then it will receive "0*40 $A $REF" as input. This is due to
> > the fact that we directly write queued transaction updates into the
> > hook's standard input, which will not contain the old object value in
> > case it wasn't provided.
> 
> In effect, the user says "I do not care if this update races with
> somebody else and it is perfectly OK if it overwrites their update"
> by not giving $B.
> 
> > While this behaviour reflects what is happening as part of the
> > repository, it doesn't feel like it is useful. The main intent of the
> > reference-transaction hook is to be able to completely audit all
> > reference updates, no matter where they come from. As such, it makes a
> > lot more sense to always provide actual values instead of what the user
> > wanted. Furthermore, it's impossible for the hook to distinguish whether
> > this is intended to be a branch creation or a branch update without
> > doing additional digging with the current format.
> 
> But shouldn't the transaction hook script be allowed to learn the
> end-user intention and behave differently?  If we replace the
> missing old object before calling the script, wouldn't it lose
> information?
> 
> The above is not an objection posed as two rhetoric questions.  I am
> purely curious why losing information is OK in this case, or why it
> may not be so OK but should still be acceptable because it is lessor
> evil than giving 0{40} to the hooks.
> 
> Even without this change, the current value the hook can learn by
> looking the ref up itself if it really wanted to, no?

I think the biggest problem is that right now, you cannot discern the
actual intention of the user because the information provided to the
hook is ambiguous in the branch creation case: "$ZERO_OID $NEW_OID $REF"
could mean the user intends to create a new branch where it shouldn't
have existed previously. BUT it could also mean that the user just
doesn't care what the reference previously pointed to.

The user could now try to derive the intention by manually looking up
the current state of the reference. But that does feel kind of awkward
to me.

To me, having clearly defined semantics ("The script always gets old and
new value of the branch regardless of what the user did") is preferable
to having ambiguous semantics.

Patrick

Junio C Hamano Jan. 20, 2021, 7:06 a.m. UTC | #3

Patrick Steinhardt <ps@pks.im> writes:

>> Even without this change, the current value the hook can learn by
>> looking the ref up itself if it really wanted to, no?
>
> I think the biggest problem is that right now, you cannot discern the
> actual intention of the user because the information provided to the
> hook is ambiguous in the branch creation case: "$ZERO_OID $NEW_OID $REF"
> could mean the user intends to create a new branch where it shouldn't
> have existed previously. BUT it could also mean that the user just
> doesn't care what the reference previously pointed to.

Yes, it can mean both, but when you pretend to be that hook,
wouldn't you check if the ref exists?  If not, the user is trying to
create it, and otherwise, the user does not know or care what the
original value is, no?

> The user could now try to derive the intention by manually looking up
> the current state of the reference. But that does feel kind of awkward
> to me.

So in short, with respect to the OLD slot, there are three kind of
end-user intention that could be conveyed to the hook:

 (1) the user does not care, so 0{40} appears in the OLD slot here,
 (2) the user is creating, so 0{40} apears in the OLD slot here, and
 (3) the user does care, and this is the OID in the OLD slot,

And (1) and (2) cannot be separated without looking at the ref (in
other words, if the hook really cares, it can find it out).

But if you replace 0{40} with the current OID, then you are making
it impossible to tell (1) and (3) apart.  The hook cannot tell the
distinction even if it is willing to go the extra mile.

So that sounds like a strict disimprovement to me.

If you can invent a way to help the hook to tell all three apart, I
am very much interested.  It would earn you a bonus point if you can
do so without breaking backward compatibility (but I doubt that it
is possible).

> To me, having clearly defined semantics ("The script always gets old and
> new value of the branch regardless of what the user did") is preferable
> to having ambiguous semantics.

But "The script always gets old that was given by the user and the
new value to be stored" is very clearly defined semantics already.

On the other hand, "The script gets a non-NULL object name in both
cases, either when the user says s/he does not care, or when the
user insists that the old value must be that, and it is not just
ambiguous but is impossible to tell apart" is worse than just being
ambiguous.

Patrick Steinhardt Jan. 22, 2021, 6:44 a.m. UTC | #4

On Tue, Jan 19, 2021 at 11:06:15PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> >> Even without this change, the current value the hook can learn by
> >> looking the ref up itself if it really wanted to, no?
> >
> > I think the biggest problem is that right now, you cannot discern the
> > actual intention of the user because the information provided to the
> > hook is ambiguous in the branch creation case: "$ZERO_OID $NEW_OID $REF"
> > could mean the user intends to create a new branch where it shouldn't
> > have existed previously. BUT it could also mean that the user just
> > doesn't care what the reference previously pointed to.
> 
> Yes, it can mean both, but when you pretend to be that hook,
> wouldn't you check if the ref exists?  If not, the user is trying to
> create it, and otherwise, the user does not know or care what the
> original value is, no?

As long as you're aware as the script author, yes.

There is one gotcha though: you can verify the state when the
reference-transaction hook gets invoked in the "prepared" state, as it
means that all references have been locked and thus cannot be changed by
any other well-behaved process accessing the git repository. In
"committed" or "aborted" that's not true anymore, given that the state
has changed already, so any locks have been released and it's impossible
to find out what happened now.

> > The user could now try to derive the intention by manually looking up
> > the current state of the reference. But that does feel kind of awkward
> > to me.
> 
> So in short, with respect to the OLD slot, there are three kind of
> end-user intention that could be conveyed to the hook:
> 
>  (1) the user does not care, so 0{40} appears in the OLD slot here,
>  (2) the user is creating, so 0{40} apears in the OLD slot here, and
>  (3) the user does care, and this is the OID in the OLD slot,
> 
> And (1) and (2) cannot be separated without looking at the ref (in
> other words, if the hook really cares, it can find it out).
> 
> But if you replace 0{40} with the current OID, then you are making
> it impossible to tell (1) and (3) apart.  The hook cannot tell the
> distinction even if it is willing to go the extra mile.
> 
> So that sounds like a strict disimprovement to me.

True.

> If you can invent a way to help the hook to tell all three apart, I
> am very much interested.  It would earn you a bonus point if you can
> do so without breaking backward compatibility (but I doubt that it
> is possible).

I did think about any way to do this, but wasn't yet able to find one.
And doing it in a backwards-compatible way is probably going to be
impossible. One idea I had is to use something similar to the
peeled-format we use in packed refs in case the actual change is
different from the user-provided change. E.g.

    0{40} <new> <ref>
    ^<old>

or

    0{40}^<old> <new> <ref>

That can be considered as backwards-incompatible though.

> > To me, having clearly defined semantics ("The script always gets old and
> > new value of the branch regardless of what the user did") is preferable
> > to having ambiguous semantics.
> 
> But "The script always gets old that was given by the user and the
> new value to be stored" is very clearly defined semantics already.
> 
> On the other hand, "The script gets a non-NULL object name in both
> cases, either when the user says s/he does not care, or when the
> user insists that the old value must be that, and it is not just
> ambiguous but is impossible to tell apart" is worse than just being
> ambiguous.

Yup. Whatever we agree on, what is clear is that the documentation needs
to be more specific here.

Patrick

Junio C Hamano Jan. 22, 2021, 6:33 p.m. UTC | #5

Patrick Steinhardt <ps@pks.im> writes:

>> Yes, it can mean both, but when you pretend to be that hook,
>> wouldn't you check if the ref exists?  If not, the user is trying to
>> create it, and otherwise, the user does not know or care what the
>> original value is, no?
>
> As long as you're aware as the script author, yes.

As you said downbelow, I agree that clear documentation may be
necessary.

> There is one gotcha though: you can verify the state when the
> reference-transaction hook gets invoked in the "prepared" state, as it
> means that all references have been locked and thus cannot be changed by
> any other well-behaved process accessing the git repository. In
> "committed" or "aborted" that's not true anymore, given that the state
> has changed already, so any locks have been released and it's impossible
> to find out what happened now.

True, but isn't the situation the same if we replaced the 0{40} old
side with (one version of) original value of the ref?

> different from the user-provided change. E.g.
>
>     0{40} <new> <ref>
>     ^<old>
>
> or
>
>     0{40}^<old> <new> <ref>
>
> That can be considered as backwards-incompatible though.

Yes, it is an incompatible change.  I thought of somehow annotating
the old side, e.g. "<old> <new> <ref>" vs "<OLD> <new> <ref>", to
show the distinction between "this is the original value of ref the
user wanted to see when updating <ref>" and "the user does not care
what value the <ref> gets updated from, but by the way, here is the
original value of the ref as Git sees it" [*], but I cannot think of
a way to do so without breaking existing readers.

    Side note: here, I am exploring the approach to replace 0{40}
    that is given when "do not care" into an actual original object
    name taken from the current state, like your patch did, but
    trying to find a way to make non-NULL object name distinguishable
    between the two cases (i.e. user-supplied vs system-filled).

That raises another question.  How much trust should the hook place
on the value of the <old> given to it?  When a non-NULL <old> value
is given by the end-user, does the hook get the value as-is, or do
we read the current value of the ref and send that as <old>?  Does
the transaction get rejected if the two are different and such a
record is not even given to the hook?

> Yup. Whatever we agree on, what is clear is that the documentation needs
> to be more specific here.

Yes, agreed.

Thanks.

[RESEND] refs: Always pass old object name to reftx hook

Commit Message

Comments

Patch