object-file: fix race in object collision check

Message ID	20241230-b4-pks-object-file-racy-collision-check-v1-1-11571294e60a@pks.im (mailing list archive)
State	Accepted
Commit	0ad3d656521aa16a6496aa855bbde97160a2b2bc
Headers	show Received: from fhigh-a1-smtp.messagingengine.com (fhigh-a1-smtp.messagingengine.com [103.168.172.152]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F48B139D19 for <git@vger.kernel.org>; Mon, 30 Dec 2024 10:32:40 +0000 (UTC) Feedback-ID: i197146af:Fastmail From: Patrick Steinhardt <ps@pks.im> Date: Mon, 30 Dec 2024 11:32:23 +0100 Subject: [PATCH] object-file: fix race in object collision check Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20241230-b4-pks-object-file-racy-collision-check-v1-1-11571294e60a@pks.im> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>, Taylor Blau <me@ttaylorr.com>
Series	object-file: fix race in object collision check \| expand object-file: fix race in object collision check

Patrick Steinhardt Dec. 30, 2024, 10:32 a.m. UTC

One of the tests in t5616 asserts that git-fetch(1) with `--refetch`
triggers repository maintenance with the correct set of arguments. This
test is flaky and causes us to fail sometimes:

    ++ git -c protocol.version=0 -c gc.autoPackLimit=0 -c maintenance.incremental-repack.auto=1234 -C pc1 fetch --refetch origin
    error: unable to open .git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack: No such file or directory
    fatal: unable to rename temporary file to '.git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack'
    fatal: could not finish pack-objects to repack local links
    fatal: index-pack failed
    error: last command exited with $?=128

The error message is quite confusing as it talks about trying to rename
a temporary packfile. A first hunch would thus be that this packfile
gets written by git-fetch(1), but removed by git-maintenance(1) while it
hasn't yet been finalized, which shouldn't ever happen. And indeed, when
looking closer one notices that the file that is supposedly of temporary
nature does not have the typical `tmp_pack_` prefix.

As it turns out, the "unable to rename temporary file" fatal error is a
red herring and the real error is "unable to open". That error is raised
by `check_collision()`, which is called by `finalize_object_file()` when
moving the new packfile into place. Because t5616 re-fetches objects, we
end up with the exact same pack as we already have in the repository. So
when the concurrent git-maintenance(1) process rewrites the preexisting
pack and unlinks it exactly at the point in time where git-fetch(1)
wants to check the old and new packfiles for equality we will see ENOENT
and thus `check_collision()` returns an error, which gets bubbled up by
`finalize_object_file()` and is then handled by `rename_tmp_packfile()`.
That function does not know about the exact root cause of the error and
instead just claims that the rename has failed.

This race is thus caused by b1b8dfde69 (finalize_object_file():
implement collision check, 2024-09-26), where we have newly introduced
the collision check.

By definition, two files cannot collide with each other when one of them
has been removed. We can thus trivially fix the issue by ignoring ENOENT
when opening either of the files we're about to check for collision.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Hi,

this patch is the follow-up for [1], where I've mentioned a couple of CI
flakes that happen rather regularly. As it turns out, this race was a
real bug hiding in the newly nitroduced object collision check in case
one of the files got unlinked while performing the check.

Thanks!

Patrick

[1]: <Z2-2dbYVuuLxpNmK@pks.im>
---
 object-file.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)


---
base-commit: 306ab352f4e98f6809ce52fc4e5d63fb947d0635
change-id: 20241230-b4-pks-object-file-racy-collision-check-62a8d1588116

Junio C Hamano Dec. 30, 2024, 2:40 p.m. UTC | #1

Patrick Steinhardt <ps@pks.im> writes:

> By definition, two files cannot collide with each other when one of them
> has been removed. We can thus trivially fix the issue by ignoring ENOENT
> when opening either of the files we're about to check for collision.

Thanks for digging it down to the cause.

It is more like even if these two files collided (i.e. have the same
name based on what the hash function says, with different contents),
when one of them has been removed, we have no way to check if the
collision is benign, and even if it were not, we cannot do anything
about it, isn't it?

I do like the simplicity of the solution.  I wonder given bad enough
race, we could fall into a case where both files are missing?

Thanks, will queue.

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Hi,
>
> this patch is the follow-up for [1], where I've mentioned a couple of CI
> flakes that happen rather regularly. As it turns out, this race was a
> real bug hiding in the newly nitroduced object collision check in case
> one of the files got unlinked while performing the check.
>
> Thanks!
>
> Patrick
>
> [1]: <Z2-2dbYVuuLxpNmK@pks.im>
> ---
>  object-file.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/object-file.c b/object-file.c
> index 5b792b3dd42cecde43a1b18abc164fd368cbcd69..f84dcd2f2a7b88716ab47bc00ee7a605a82e8d21 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1978,13 +1978,15 @@ static int check_collision(const char *filename_a, const char *filename_b)
>  
>  	fd_a = open(filename_a, O_RDONLY);
>  	if (fd_a < 0) {
> -		ret = error_errno(_("unable to open %s"), filename_a);
> +		if (errno != ENOENT)
> +			ret = error_errno(_("unable to open %s"), filename_a);
>  		goto out;
>  	}
>  
>  	fd_b = open(filename_b, O_RDONLY);
>  	if (fd_b < 0) {
> -		ret = error_errno(_("unable to open %s"), filename_b);
> +		if (errno != ENOENT)
> +			ret = error_errno(_("unable to open %s"), filename_b);
>  		goto out;
>  	}
>  
>
> ---
> base-commit: 306ab352f4e98f6809ce52fc4e5d63fb947d0635
> change-id: 20241230-b4-pks-object-file-racy-collision-check-62a8d1588116

Patrick Steinhardt Dec. 30, 2024, 2:50 p.m. UTC | #2

On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > By definition, two files cannot collide with each other when one of them
> > has been removed. We can thus trivially fix the issue by ignoring ENOENT
> > when opening either of the files we're about to check for collision.
> 
> Thanks for digging it down to the cause.
> 
> It is more like even if these two files collided (i.e. have the same
> name based on what the hash function says, with different contents),
> when one of them has been removed, we have no way to check if the
> collision is benign, and even if it were not, we cannot do anything
> about it, isn't it?

Depends on what "benign" means in this context, I guess. We can only
assert the most trivial case of it being "benign", namely that we have
computed a packfile that actually is the exact same. This is also going
to be the most common case, as everything else would depend on a
cryptographic collision of the packfile contents. And in that case... we
cannot do anything about it, yes.

> I do like the simplicity of the solution.  I wonder given bad enough
> race, we could fall into a case where both files are missing?

I was wondering about that, too, but it would very much feel like a bug
to me if that were ever to happen. So I briefly considered whether I
should treat the passed-in filenames differently: 

  - One that must exist non-racily. This is our temporary object or
    packfile that we want to move into place.

  - And one that may have been removed racily. This is our target file
    path that we want to overwrite, unless there is a collision.

The idea would be to only handle ENOENT for the second case. But in the
end I don't think it's worth the complexity because `check_collision()`
is used before rename(3p)ing the former into place, and that function
would already notice ENOENT anyway. So we would eventually just die the
same.

Patrick

Junio C Hamano Dec. 30, 2024, 2:54 p.m. UTC | #3

Patrick Steinhardt <ps@pks.im> writes:

> On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote:
>> Patrick Steinhardt <ps@pks.im> writes:
>> 
>> > By definition, two files cannot collide with each other when one of them
>> > has been removed. We can thus trivially fix the issue by ignoring ENOENT
>> > when opening either of the files we're about to check for collision.
>> 
>> Thanks for digging it down to the cause.
>> 
>> It is more like even if these two files collided (i.e. have the same
>> name based on what the hash function says, with different contents),
>> when one of them has been removed, we have no way to check if the
>> collision is benign, and even if it were not, we cannot do anything
>> about it, isn't it?
>
> Depends on what "benign" means in this context, I guess. We can only
> assert the most trivial case of it being "benign", namely that we have
> computed a packfile that actually is the exact same. This is also going
> to be the most common case, as everything else would depend on a
> cryptographic collision of the packfile contents. And in that case... we
> cannot do anything about it, yes.

Yes, the whole point of this collision check is to notice the rare
case where we are under attack with cryptographic collision, so
"most of the time it is OK so not being able to check is fine" is
not an impression we want to give readers.

> The idea would be to only handle ENOENT for the second case. But in the
> end I don't think it's worth the complexity because `check_collision()`
> is used before rename(3p)ing the former into place, and that function
> would already notice ENOENT anyway. So we would eventually just die the
> same.

Thanks for thinking it through.  Queued.

Jeff King Dec. 31, 2024, 1:42 a.m. UTC | #4

On Mon, Dec 30, 2024 at 03:50:04PM +0100, Patrick Steinhardt wrote:

> On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > By definition, two files cannot collide with each other when one of them
> > > has been removed. We can thus trivially fix the issue by ignoring ENOENT
> > > when opening either of the files we're about to check for collision.
> > 
> > Thanks for digging it down to the cause.
> > 
> > It is more like even if these two files collided (i.e. have the same
> > name based on what the hash function says, with different contents),
> > when one of them has been removed, we have no way to check if the
> > collision is benign, and even if it were not, we cannot do anything
> > about it, isn't it?
> 
> Depends on what "benign" means in this context, I guess. We can only
> assert the most trivial case of it being "benign", namely that we have
> computed a packfile that actually is the exact same. This is also going
> to be the most common case, as everything else would depend on a
> cryptographic collision of the packfile contents. And in that case... we
> cannot do anything about it, yes.

There is one gotcha here, though. We call this collision check only if
we got EEXIST trying to move the tempfile into place. If the destination
file then goes away, we can't do the collision check. But is it right to
quietly return success?

If the contents of the two were the same, that's fine. We don't need the
extra copy.

But if the contents were not the same, we'd prefer either to actually
copy the contents into place, or to return an error.

Of course we can't know, because the destination file has gone away. In
the common case they will be the same, but the whole point of this check
is to allow loosening the cryptographic collision of the packfile
contents. So the safest thing would be to retain the tempfile, copying
it into the destination file. That errs on the side of keeping data when
we cannot make a determination.

IOW, if we see ENOENT on filename_b, should we then loop back in the
caller to try the link() again?

> > I do like the simplicity of the solution.  I wonder given bad enough
> > race, we could fall into a case where both files are missing?
> 
> I was wondering about that, too, but it would very much feel like a bug
> to me if that were ever to happen. So I briefly considered whether I
> should treat the passed-in filenames differently: 
> 
>   - One that must exist non-racily. This is our temporary object or
>     packfile that we want to move into place.
> 
>   - And one that may have been removed racily. This is our target file
>     path that we want to overwrite, unless there is a collision.
> 
> The idea would be to only handle ENOENT for the second case. But in the
> end I don't think it's worth the complexity because `check_collision()`
> is used before rename(3p)ing the former into place, and that function
> would already notice ENOENT anyway. So we would eventually just die the
> same.

I think check_collision() is used _after_ the attempt to rename() into
place. So there's a race when the tempfile goes away, but I think the
outcome is made a bit worse by your patch.

Consider a sequence like this:

  a. Process A writes tmp_pack_foo.

  b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it
     already exists.

  c. Process A opens both tmp_pack_foo and pack-<hash>.

  d. Process A compares the two byte-for-byte, and then returns
     success/failure based on whether they were actually identical.

Now imagine there is a process B that deletes the file (maybe an
over-zealous "gc --prune=now" deletes the in-use temporary file):

 - if process B deletes it between steps (a) and (b), process A returns
   an error (there is nothing to link). The caller knows that the data
   was not stored.

 - if process B deletes it between (b) and (c), then before your patch
   we see an error (because we can't compare the files). After your
   patch, we continue on and return success. The caller knows the data
   was stored (via the original file, not our new copy).

 - if process B deletes it between (c) and (d), then process A has no
   idea. But at this point it does not matter. If the files were
   identical, we return success (and in fact, process A deletes the file
   itself). And if not identical, then we return error, and the callers
   knows the data was not stored.

So even though the exact behavior may depend on where we hit the race, I
think ignoring an ENOENT open() error on the tempfile meaningfully
changes what happens in the middle case.

In practice I don't really expect this to happen, and "gc --prune=now"
is inherently risky in a live repository. But I think we're probably
better off to continue treating it as an error if we can't open our own
tempfile.

-Peff

Junio C Hamano Jan. 1, 2025, 4:50 p.m. UTC | #5

Jeff King <peff@peff.net> writes:

> There is one gotcha here, though. We call this collision check only if
> we got EEXIST trying to move the tempfile into place. If the destination
> file then goes away, we can't do the collision check. But is it right to
> quietly return success?
>
> If the contents of the two were the same, that's fine. We don't need the
> extra copy.
>
> But if the contents were not the same, we'd prefer either to actually
> copy the contents into place, or to return an error.
>
> Of course we can't know, because the destination file has gone away. In
> the common case they will be the same, but the whole point of this check
> is to allow loosening the cryptographic collision of the packfile
> contents. So the safest thing would be to retain the tempfile, copying
> it into the destination file. That errs on the side of keeping data when
> we cannot make a determination.
>
> IOW, if we see ENOENT on filename_b, should we then loop back in the
> caller to try the link() again?

Yuck, I think you're absolutely right.

> I think check_collision() is used _after_ the attempt to rename() into
> place. So there's a race when the tempfile goes away, but I think the
> outcome is made a bit worse by your patch.
>
> Consider a sequence like this:
>
>   a. Process A writes tmp_pack_foo.
>
>   b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it
>      already exists.
>
>   c. Process A opens both tmp_pack_foo and pack-<hash>.
>
>   d. Process A compares the two byte-for-byte, and then returns
>      success/failure based on whether they were actually identical.
>
> Now imagine there is a process B that deletes the file (maybe an
> over-zealous "gc --prune=now" deletes the in-use temporary file):
>
>  - if process B deletes it between steps (a) and (b), process A returns
>    an error (there is nothing to link). The caller knows that the data
>    was not stored.
>
>  - if process B deletes it between (b) and (c), then before your patch
>    we see an error (because we can't compare the files). After your
>    patch, we continue on and return success. The caller knows the data
>    was stored (via the original file, not our new copy).
>
>  - if process B deletes it between (c) and (d), then process A has no
>    idea. But at this point it does not matter. If the files were
>    identical, we return success (and in fact, process A deletes the file
>    itself). And if not identical, then we return error, and the callers
>    knows the data was not stored.
>
> So even though the exact behavior may depend on where we hit the race, I
> think ignoring an ENOENT open() error on the tempfile meaningfully
> changes what happens in the middle case.
>
> In practice I don't really expect this to happen, and "gc --prune=now"
> is inherently risky in a live repository. But I think we're probably
> better off to continue treating it as an error if we can't open our own
> tempfile.

So we'd ignore the racy and flaky tests, as hiding the flake by
ignoring the error would only hurt the real world users.

Jeff King Jan. 1, 2025, 6:19 p.m. UTC | #6

On Wed, Jan 01, 2025 at 08:50:51AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > There is one gotcha here, though. We call this collision check only if
> > we got EEXIST trying to move the tempfile into place. If the destination
> > file then goes away, we can't do the collision check. But is it right to
> > quietly return success?
> >
> > If the contents of the two were the same, that's fine. We don't need the
> > extra copy.
> >
> > But if the contents were not the same, we'd prefer either to actually
> > copy the contents into place, or to return an error.
> >
> > Of course we can't know, because the destination file has gone away. In
> > the common case they will be the same, but the whole point of this check
> > is to allow loosening the cryptographic collision of the packfile
> > contents. So the safest thing would be to retain the tempfile, copying
> > it into the destination file. That errs on the side of keeping data when
> > we cannot make a determination.
> >
> > IOW, if we see ENOENT on filename_b, should we then loop back in the
> > caller to try the link() again?
> 
> Yuck, I think you're absolutely right.

So I think this part, if adjusted as I suggested, would fix the race in
the tests without making anything worse (it's just more code).

And then this...

> > I think check_collision() is used _after_ the attempt to rename() into
> > place. So there's a race when the tempfile goes away, but I think the
> > outcome is made a bit worse by your patch.
> >
> > Consider a sequence like this:
> >
> >   a. Process A writes tmp_pack_foo.
> >
> >   b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it
> >      already exists.
> >
> >   c. Process A opens both tmp_pack_foo and pack-<hash>.
> >
> >   d. Process A compares the two byte-for-byte, and then returns
> >      success/failure based on whether they were actually identical.
> >
> > Now imagine there is a process B that deletes the file (maybe an
> > over-zealous "gc --prune=now" deletes the in-use temporary file):
> >
> >  - if process B deletes it between steps (a) and (b), process A returns
> >    an error (there is nothing to link). The caller knows that the data
> >    was not stored.
> >
> >  - if process B deletes it between (b) and (c), then before your patch
> >    we see an error (because we can't compare the files). After your
> >    patch, we continue on and return success. The caller knows the data
> >    was stored (via the original file, not our new copy).
> >
> >  - if process B deletes it between (c) and (d), then process A has no
> >    idea. But at this point it does not matter. If the files were
> >    identical, we return success (and in fact, process A deletes the file
> >    itself). And if not identical, then we return error, and the callers
> >    knows the data was not stored.
> >
> > So even though the exact behavior may depend on where we hit the race, I
> > think ignoring an ENOENT open() error on the tempfile meaningfully
> > changes what happens in the middle case.
> >
> > In practice I don't really expect this to happen, and "gc --prune=now"
> > is inherently risky in a live repository. But I think we're probably
> > better off to continue treating it as an error if we can't open our own
> > tempfile.
> 
> So we'd ignore the racy and flaky tests, as hiding the flake by
> ignoring the error would only hurt the real world users.

...is all about ignoring ENOENT on the tmpfile itself. And I think we
can just drop that hunk entirely. The tests do not care here (they are
running simultaneous gc, but _not_ a simultaneous "--prune=now" gc).

-Peff

Patrick Steinhardt Jan. 3, 2025, 7:16 a.m. UTC | #7

On Wed, Jan 01, 2025 at 01:19:52PM -0500, Jeff King wrote:
> On Wed, Jan 01, 2025 at 08:50:51AM -0800, Junio C Hamano wrote:
> 
> > Jeff King <peff@peff.net> writes:
> > 
> > > There is one gotcha here, though. We call this collision check only if
> > > we got EEXIST trying to move the tempfile into place. If the destination
> > > file then goes away, we can't do the collision check. But is it right to
> > > quietly return success?
> > >
> > > If the contents of the two were the same, that's fine. We don't need the
> > > extra copy.
> > >
> > > But if the contents were not the same, we'd prefer either to actually
> > > copy the contents into place, or to return an error.
> > >
> > > Of course we can't know, because the destination file has gone away. In
> > > the common case they will be the same, but the whole point of this check
> > > is to allow loosening the cryptographic collision of the packfile
> > > contents. So the safest thing would be to retain the tempfile, copying
> > > it into the destination file. That errs on the side of keeping data when
> > > we cannot make a determination.
> > >
> > > IOW, if we see ENOENT on filename_b, should we then loop back in the
> > > caller to try the link() again?
> > 
> > Yuck, I think you're absolutely right.
> 
> So I think this part, if adjusted as I suggested, would fix the race in
> the tests without making anything worse (it's just more code).

Hm, okay, I think that makes sense. Instead of being optimistic like my
first version was we should be pessimistic and assume the worst. Which I
guess is a sensible stance to take in a collision check.

> And then this...
> 
> > > I think check_collision() is used _after_ the attempt to rename() into
> > > place. So there's a race when the tempfile goes away, but I think the
> > > outcome is made a bit worse by your patch.
> > >
> > > Consider a sequence like this:
> > >
> > >   a. Process A writes tmp_pack_foo.
> > >
> > >   b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it
> > >      already exists.
> > >
> > >   c. Process A opens both tmp_pack_foo and pack-<hash>.
> > >
> > >   d. Process A compares the two byte-for-byte, and then returns
> > >      success/failure based on whether they were actually identical.
> > >
> > > Now imagine there is a process B that deletes the file (maybe an
> > > over-zealous "gc --prune=now" deletes the in-use temporary file):
> > >
> > >  - if process B deletes it between steps (a) and (b), process A returns
> > >    an error (there is nothing to link). The caller knows that the data
> > >    was not stored.
> > >
> > >  - if process B deletes it between (b) and (c), then before your patch
> > >    we see an error (because we can't compare the files). After your
> > >    patch, we continue on and return success. The caller knows the data
> > >    was stored (via the original file, not our new copy).
> > >
> > >  - if process B deletes it between (c) and (d), then process A has no
> > >    idea. But at this point it does not matter. If the files were
> > >    identical, we return success (and in fact, process A deletes the file
> > >    itself). And if not identical, then we return error, and the callers
> > >    knows the data was not stored.
> > >
> > > So even though the exact behavior may depend on where we hit the race, I
> > > think ignoring an ENOENT open() error on the tempfile meaningfully
> > > changes what happens in the middle case.
> > >
> > > In practice I don't really expect this to happen, and "gc --prune=now"
> > > is inherently risky in a live repository. But I think we're probably
> > > better off to continue treating it as an error if we can't open our own
> > > tempfile.
> > 
> > So we'd ignore the racy and flaky tests, as hiding the flake by
> > ignoring the error would only hurt the real world users.
> 
> ...is all about ignoring ENOENT on the tmpfile itself. And I think we
> can just drop that hunk entirely. The tests do not care here (they are
> running simultaneous gc, but _not_ a simultaneous "--prune=now" gc).

Yeah, we don't need this hunk then.

I'll send a follow-up change.

Patrick

object-file: fix race in object collision check

Commit Message

Comments

Patch